Tuesday, July 19, 2011

Rapid analysis of melting point trends and models using Google Apps Scripts

I recently reported on how Google Apps Scripts can be used to facilitate the recording and calculations associated with a chemistry laboratory notebook. (also see resource page)

I will demonstrate here how these scripts can be used to rapidly discover trends in the melting points of analogs for the curation of data and the evaluation of models. The two melting point services that Andrew Lang created under the gONS menu were used to keep track of the measured and predicted melting points for all reactants and product as part of a "dashboard view" of the reaction being performed.

For looking at melting point trends, the following template sheet can be used.

For reasons explained previously, the template sheet has no active scripts in the page (except for the images). These are just the values generated from running the scripts corresponding to the column headings on the common names. In order to use for another series of compounds just make a copy of the entire Google Spreadsheet (File->Make a Copy) then enter the new list and pick the desired script to run from the menus. Once the values are computed remember to copy and paste as values.

It is important to understand that our melting point service is not a "trusted source" - it simply reports the average of all recorded data sources, ignoring values marked as DONOUSE. That means that not all data points are equal and it is up to the user to determine a threshold of some type to decide how to use a particular data point.

In this investigation, I have marked in green averaged experimental values where at least 3 different values are clustered within a few degrees. A link in column H is automatically generated from the CSID to provide a very convenient way to evaluate the data sources. For example the link for methanol has 3 very close but different melting point values: -98 C, -97.6 C and -97.53 C. The -98 C value is repeated 7 times because this resulted from the automatic merging of several Open Collections.

In general we don't manually add values that are identical from different sources because it is likely that these all originate from the same source. We have to make that assumption because proper data provenance is usually lacking in chemical information sources today. A Google search will often return the same one or two melting points from dozens of sites, which may turn out to be an outlier when compared with other independent sources. (CAS numbers are generated in the template sheet because they are useful for searching Google for melting points - for example see here for methanol)

In another scenario where there are 3 or more different but close values and a few clear marked outliers, I considered these averages as having passed my threshold and colored these green as well. A good example is ethanol, which I have previously used to illustrate our curation method.

It turns out that for the series of n-alcohols from methanol to 1-decanol, I was able to mark in green every experimental melting point average, making the confidence level of the following plot about as high as it can get from current chemical information sources.

It is particularly gratifying to note that the predicted melting points based on Andrew Lang's random forest Model002 perform very well here, even predicting a melting point minimum at 3 carbons. Note that this model is Open Source and uses Open Descriptors derived from the CDK. It does not yet include the results of our most recent curation efforts. Any new models incorporating improved datasets will be listed here.

Extending the analysis to n-alkyl carboxylic acids from formic acid to decanoic acid provides the following plot, with the same confidence for the experimental averages.

For this series, the random forest model not only predicts that the lowest melting point is for the 5 carbon analog but it also appears to take the shape of a zig-zag pattern, especially for the first 6 acids. Since this alternating pattern has been attributed to the way that carboxylic acid dimer bilayers pack in 3D (Bond2004), it is hard to imagine how simple 2D descriptors from the CDK can predict this. We will have to investigate this in more detail.

More generally, molecular symmetry can greatly affect the melting point via the way that crystals pack in 3D (see Carnelley's Rule, Brown2000). At some point we would like to incorporate this factor in our models. The current model should not be able to make predictions based on symmetry or stereochemistry.

We can also explore the melting point patterns of cyclic systems. Going from cyclopropane to cyclohexane there is a large jump from a 5 to a 6 membered ring and this is roughly reflected in the model:

Cycloalkanones behave similarly to cycloalkanes, showing a jump from 5 to 6 membered rings which agrees well with the model going from cyclobutanone to cyclohexanone:

However, in going from methylcylopropane to methylcyclohexane, the model diverges substantially from experimental results. It does start to get harder to find corroborating melting points and only 2 values can be found for methylcyclobutane.

Going from cyclopropanecarboxylic acid to cyclohexanecarboxylic acid shows a U-type pattern and is not well matched by the model. However, there is additional uncertainty about the melting point of cyclopentanecarboxylic acid.

For the series from cyclopropylamine to cyclohexylamine, there initially appears to be a significant mismatch between the model and experiment. However, because we have retained the provenance information in the spreadsheet it becomes clear that the cyclobutylamine number (in the orange square below) comes from a single source. There is actually a good match between the other 3 values. However, as demonstrated here, there has not been enough information on when the model is reliable to assign the source of the discrepancy at this point.

These examples show that provenance information is a critical dimension in the analysis of trends in melting point data. The Google Apps Scripts and associated Google Spreadsheet template presented here offer a quick and convenient way to provide access to both averaged values and a way of assessing confidence in an averaged value. Performing these tasks manually is generally too time-consuming to encourage researchers to follow such a practice. This is perhaps the reason that the current peer-review process accepts a single "trusted source" in analyses of this kind, even though such a practice inevitably leads to mis-interpretations and errors that cascade through the scientific literature.


At 3:37 AM, Blogger Egon Willighagen said...

Nice post! The first graphs show very nicely the concept of generalization! Cool :)

BTW, this approach will indeed become an important tool in publishing. In fact, there are enough in our community who have been advocating this for more than a decade now (e.g. Peter Murray-Rust). They can catch many typing errors, as well as more serious errors, and are a cheap, simple, but very effective extension of 'peer-review'.

At 5:42 AM, Anonymous PostBacc said...

This is the first time I've come across google scripts used this way. Cloud computing will definitely be used in the future within the scientific community. Guess its time to learn how to write some scripts :D

At 8:42 AM, Anonymous Don Pellegrino said...

This is a useful example of the role provenance data can play experimental data analysis.


Post a Comment

<< Home

Creative Commons Attribution Share-Alike 2.5 License