Monday, January 17, 2011

Science Online 2011 Thoughts

On January 15, 2011 I co-moderated a Science Online 2011 session on Open Notebook Science with Antony Williams and Carl Boettiger. The projector failed so we did our best to introduce the topic without relying on visual aids. My main objective was to demonstrate that it is not necessary for researchers (or their machines) to interface with the actual lab notebook to benefit from the information generated from the work. By introducing simple and rapid abstraction steps, both solubility and reaction information can be converted to web services for a variety of uses. As long as a link to the original lab notebook page (including the raw data) is attached, no information is lost and details can be investigated on demand.

One of the most powerful tools to use in this context is the tracking of chemical entities as ChemSpider IDs. This enables direct access to many other web services which Andrew Lang and I have leveraged to generate our own services. Tony spoke a bit more about this in his part and outlined some of the benefits and frustrations with crowdsourcing. Carl spoke eloquently about his experiences with Open Notebook Science as a graduate student for computational projects. The slides from all of us are provided below.

The overall tone of the discussion during our session was quite positive and productive. This was the case with all of the other sessions I attended, as it has been in prior years. The Science Online conference has evolved to attract a large proportion of people advocating Open Science. The presenters and the audience feel that they are among friends and the result is usually a free and easy exchange of ideas. Not all conferences and symposia relating to the online aspects of science share this. I have seen many examples where the "online science" theme is overrun by Closed Science proponents, for example commercial databases or Electronic Laboratory Notebook (ELN) vendors. Hopefully this conference will retain its Open Science focus in the future.

Kaitlin Thaney proved to be a very effective moderator during her session on "The Digital Toolbox: What's Needed?" and she stirred up some insightful discussion. I also enjoyed Steve Koch's session (co-moderated with Kiyomi Deards and Molly Keener) on "Data Discoverability: Institutional Support Strategies". Steve shared a particularly compelling example of the collaborative benefits of Open Notebook Science, where a computational research group came across images and videos from one of his group's notebooks and incorporated these in their paper - with all due credit acknowledged.

I very much appreciated the opportunity to catch up with old friends and some new. I had never met Carl Boettiger in person before and we had some very interesting discussions about Open Science and Open Education. It was good to meet Mark Hahnel from FigShare and explore possible paths for data sharing. I had some nice chats with Antony Williams, Steve Koch, Steven Bachrach, Heather Piwowar and Ana Nelson.

The Saturday evening banquet proved to be surprisingly entertaining. Despite the sedate title of her talk, "Out on a Limb: Challenges of Training Scientists to Communicate", Meg Lowman pounded the audience with a hilarious performance. Science comedian Brian Malow kicked this up a notch with some very clever material. Later on, using a brilliant comedic judo technique, he repeated some choice derisive comments he received from his performances on YouTube. I hope he comes back next year!

Labels: , ,

Wednesday, January 12, 2011

Talk on Open Education in Chemistry at the University of the Sciences

I presented on "Open Education in Chemistry Research and Classroom" at the University of the Sciences on January 11, 2011. The talk covered screencasting, wikis, Open Notebook Science, games and smartphones.

This was also an opportunity to present some new work Andrew Lang and I did on Chemical Information Validation, resulting from the students in my Chemical Information Retrieval class during the Fall 2010 term. I also highlighted Don Pellegrino's recent social/chemical network visualization project.

Wednesday, January 05, 2011

Chemical Information Validation Results from Fall 2010

As I mentioned earlier, one of the outcomes from my Fall 2010 Chemical Information Retrieval class involved the collection of chemical property information from different sources in a database format. Now that the course is over, this has resulted in 567 measurements for 24 compounds (including one compound EGCG from the previous term). I have curated the dataset to ensure that the original numbers, conversions to common units, categorizations, etc. are correct. Links to the information source or to a screenshot of the source are available for each entry - so if I missed something, anyone can unambiguously verify it for correction.

The dataset is available from a Google Spreadsheet. Andrew Lang has also created a web based interface: the ChemInfo Validation Explorer. By simply specifying the compound of interest and the property using drop-down menus the list of measurements from the relevant sources is provided with values outside of one standard deviation marked in orange. Links to the information source, or an image in cases where the information source cannot be directly linked, are provided in the results. Here is an example for the boiling point of benzene.

The visualization and analysis of the data was greatly facilitated by the use of Tableau Public. After downloading the free program anyone can easily re-create the queries in this post by first downloading the dataset as an Excel document then importing into Tableau Public. Interactive charts can then be freely hosted on the TP server and embedded as I have done in this post below.

The students were shown how to search both commercial and free information sources and were given complete freedom for which compounds and chemical properties to target. The results can be analyzed from the perspective of a reasonable sampling of the current state of chemical information available to the average chemist. The 5 most frequently obtained properties were melting point, density, boiling point, flash point and refractive index.

The information sources were categorized and are reported below by frequency. Chemical vendor sites were by far the most frequently used information source.

It is important to note that the information source does not represent the method by which the measurements were found. The source is simply the end of the chain of provenance: the document that provides no specific reference for the reported measurement. For example, even though ChemSpider was frequently used as a search engine, it would not be listed as an information source when it provided links to other sources (mainly MSDS sheets) for properties. ChemSpider was treated as a source for some predicted properties.

The chemical vendor Sigma-Aldrich was the most frequently used information source, followed by Alfa Aesar. Wolfram Alpha - categorized as a "free database" was third. Oxford University follows closely behind as fourth and is categorized as an "academic website", hosting MSDS sheets. Many universities host MSDS sheets but the Oxford web site seems to turn up most frequently from chemical property queries on search engines.

The fifth most frequent information source was Wikipedia, reflecting the fact that specific references are usually not provided there for chemical properties. Like ChemSpider, Wikipedia was categorized as a "crowdsourced database".

Flagging Outliers

One of the advantages of this type of collection is that it is much easier to identify outliers. In the case of non-aqueous solubility data, we were able to create an outlier bot to automatically flag potentially problematic results. Since different properties may have very different typical variabilities, outliers are most easily discovered by comparisons within the same property.

For example consider the following plot showing the standard deviation to mean ratio for melting point measurements.

This reveals that the average melting point for EGCG is suspect. At this point, an easy way to inspect the results is to use the Validation Explorer and look at the individual measurements.

By clicking on the images we can verify that the numbers have been correctly copied from the primary sources. In this case we can also ascertain that the sources - a peer reviewed paper and the Merck Index - are considered by most chemists to be generally reliable. There is no compelling reason at this point to weigh one result over the other and one has to be careful when using the average value for any practical application. (Note that all temperature data is recorded as Kelvin. A zero-based scale is necessary to ensure that the standard deviation to mean ratio is meaningful.)

The next flagging hit in this collection is the melting point of cyclohexanone. In this case 5 results are returned and the Validation Explorer highlights the Alfa Aesar value as being more than one standard deviation from the average.

However, one has to be careful when assessing this and assuming that the Alfa Aesar value is most likely to being the odd value out. Notice that 3 of the values - Sigma-Aldrich, Acros and Wolfram Alpha are identical. The most likely explanation for this is that all three used the same information source and should thus be counted as a single measurement.

We can test this hypothesis by looking for cases where Sigma-Aldrich, Acros and Wolfram Alpha don't share identical values. As shown below, for melting point measurements, there is no case where the values don't match.

The same is true for boiling points:

However, in the case of flash points it is clear that the three are not using a common data source.

Using the data we collected - and will continue to collect - we could start to identify which data sources are likely using the same ultimate sources and avoid over-counting measurements. This would save time in searching since one would know which sources to check for a particular property while avoiding duplication. This information is extremely difficult to obtain using other approaches.

The same type of outlier analysis can be performed for all the properties collected in this study.

I believe that there is much more useful analysis to be done on this dataset, especially for chemistry librarians. When this class is run next year, more data will be added. In the meantime, contributions from other sources would be welcome.

Creative Commons Attribution Share-Alike 2.5 License