Tuesday, July 19, 2011

Rapid analysis of melting point trends and models using Google Apps Scripts

I recently reported on how Google Apps Scripts can be used to facilitate the recording and calculations associated with a chemistry laboratory notebook. (also see resource page)

I will demonstrate here how these scripts can be used to rapidly discover trends in the melting points of analogs for the curation of data and the evaluation of models. The two melting point services that Andrew Lang created under the gONS menu were used to keep track of the measured and predicted melting points for all reactants and product as part of a "dashboard view" of the reaction being performed.

For looking at melting point trends, the following template sheet can be used.

For reasons explained previously, the template sheet has no active scripts in the page (except for the images). These are just the values generated from running the scripts corresponding to the column headings on the common names. In order to use for another series of compounds just make a copy of the entire Google Spreadsheet (File->Make a Copy) then enter the new list and pick the desired script to run from the menus. Once the values are computed remember to copy and paste as values.

It is important to understand that our melting point service is not a "trusted source" - it simply reports the average of all recorded data sources, ignoring values marked as DONOUSE. That means that not all data points are equal and it is up to the user to determine a threshold of some type to decide how to use a particular data point.

In this investigation, I have marked in green averaged experimental values where at least 3 different values are clustered within a few degrees. A link in column H is automatically generated from the CSID to provide a very convenient way to evaluate the data sources. For example the link for methanol has 3 very close but different melting point values: -98 C, -97.6 C and -97.53 C. The -98 C value is repeated 7 times because this resulted from the automatic merging of several Open Collections.

In general we don't manually add values that are identical from different sources because it is likely that these all originate from the same source. We have to make that assumption because proper data provenance is usually lacking in chemical information sources today. A Google search will often return the same one or two melting points from dozens of sites, which may turn out to be an outlier when compared with other independent sources. (CAS numbers are generated in the template sheet because they are useful for searching Google for melting points - for example see here for methanol)

In another scenario where there are 3 or more different but close values and a few clear marked outliers, I considered these averages as having passed my threshold and colored these green as well. A good example is ethanol, which I have previously used to illustrate our curation method.

It turns out that for the series of n-alcohols from methanol to 1-decanol, I was able to mark in green every experimental melting point average, making the confidence level of the following plot about as high as it can get from current chemical information sources.

It is particularly gratifying to note that the predicted melting points based on Andrew Lang's random forest Model002 perform very well here, even predicting a melting point minimum at 3 carbons. Note that this model is Open Source and uses Open Descriptors derived from the CDK. It does not yet include the results of our most recent curation efforts. Any new models incorporating improved datasets will be listed here.

Extending the analysis to n-alkyl carboxylic acids from formic acid to decanoic acid provides the following plot, with the same confidence for the experimental averages.

For this series, the random forest model not only predicts that the lowest melting point is for the 5 carbon analog but it also appears to take the shape of a zig-zag pattern, especially for the first 6 acids. Since this alternating pattern has been attributed to the way that carboxylic acid dimer bilayers pack in 3D (Bond2004), it is hard to imagine how simple 2D descriptors from the CDK can predict this. We will have to investigate this in more detail.

More generally, molecular symmetry can greatly affect the melting point via the way that crystals pack in 3D (see Carnelley's Rule, Brown2000). At some point we would like to incorporate this factor in our models. The current model should not be able to make predictions based on symmetry or stereochemistry.

We can also explore the melting point patterns of cyclic systems. Going from cyclopropane to cyclohexane there is a large jump from a 5 to a 6 membered ring and this is roughly reflected in the model:

Cycloalkanones behave similarly to cycloalkanes, showing a jump from 5 to 6 membered rings which agrees well with the model going from cyclobutanone to cyclohexanone:

However, in going from methylcylopropane to methylcyclohexane, the model diverges substantially from experimental results. It does start to get harder to find corroborating melting points and only 2 values can be found for methylcyclobutane.

Going from cyclopropanecarboxylic acid to cyclohexanecarboxylic acid shows a U-type pattern and is not well matched by the model. However, there is additional uncertainty about the melting point of cyclopentanecarboxylic acid.

For the series from cyclopropylamine to cyclohexylamine, there initially appears to be a significant mismatch between the model and experiment. However, because we have retained the provenance information in the spreadsheet it becomes clear that the cyclobutylamine number (in the orange square below) comes from a single source. There is actually a good match between the other 3 values. However, as demonstrated here, there has not been enough information on when the model is reliable to assign the source of the discrepancy at this point.

These examples show that provenance information is a critical dimension in the analysis of trends in melting point data. The Google Apps Scripts and associated Google Spreadsheet template presented here offer a quick and convenient way to provide access to both averaged values and a way of assessing confidence in an averaged value. Performing these tasks manually is generally too time-consuming to encourage researchers to follow such a practice. This is perhaps the reason that the current peer-review process accepts a single "trusted source" in analyses of this kind, even though such a practice inevitably leads to mis-interpretations and errors that cascade through the scientific literature.

Thursday, July 14, 2011

Practical Tips on using Google Apps Scripts for Chemistry Applications

A few weeks ago I described our use of Google Apps Scripts, developed by Rich Apodaca and Andrew Lang, as an intuitive interface to information related to a chemistry laboratory notebook. Since then we have been using these tools to actively plan and record experiments (e.g. UC-EXP269) and we have learned their strengths and weaknesses.

The most problematic aspect of Google Apps Scripts running within Google Spreadsheets turns out to be the way caching and refreshing operate. There does not appear to be an obvious way to refresh a single cell. So if a script times out or fails, Google stores that failed output on their servers and will not run it again until some time has elapsed (which seems to be on the order of about an hour). Typing in a new input for that cell will cause the script to run again but entering a previously entered input will only retrieve the cached output, even a failed output. For example, if you have a cell calculating the MW from "benzene" entered in another cell and the script fails for any reason, typing in "ethanol" will get it to run again for the new input, but going back to "benzene" will just pull up the cached output of "Failed".

Nevertheless, I did come across some tricks to force a refresh indirectly. If you insert a row or column then re-enter the desired scripts in the new cells, they will run again. You simply need to then delete the old column with failed outputs. This is fine for simple sheets but it can be a headache for sheets that have several calculation dependencies between cells.

To avoid these complications, simply refresh the entire sheet by duplicating it, deleting the old sheet and then renaming the new one to the original name. The problem now is that it will refresh all the cells, not just those that had failed outputs. And if there are a large number of scripts on that sheet the odds are good that at least one will fail on that particular attempt, especially if several are hitting the same web server.

As a result of all these problems, I would not recommend using these services as I had initially hoped, where a researcher would enter data into a template sheet loaded with scripts to automatically generate a series of calculated outputs. There is a way to achieve this end but it requires thinking about the scripts in a slightly different way.

As I mentioned above, there are tricks for refreshing an entire sheet or a column or row. In order to avoid re-running the scripts that already returned desired outputs, we need to lock them in. This can be done by highlighting the completed cells, copying them (either control-c or Edit->Copy) then pasting them as values (from the Edit menu). Now refreshing will only be done on the cells with failed outputs and these can be locked in as well as soon as they complete.

The downside of this approach is that you lose the information about which script was run to generate the output values. And to change an input requires re-selecting the desired script. But in practice it is so convenient to hit a dropdown menu and hit getMW (for example) that this downside is quite minimal, especially when contrasted with the upside of knowing that others will see your information reliably, independent of how the services are running at a particular time.

Over the past few weeks we have found that some services fail more often than others and it would be advantageous to have some redundancies. This has been particularly problematic for the cactus services recently, which we often use for resolving common names. By using ChemSpiderIDs (CSIDs), the cactus services can be bypassed for several of the gONS services. So a good practice for any application is to generate and lock in SMILES and CSIDs right away from the common name. CAS numbers can be used too but the gChem service that Rich has created sometimes yields multiple CAS numbers and these will fail as input for a subsequent script.

We now have a chemistry Google Apps Scripts spreadsheet to keep track of which inputs are allowed for all the available services, along with information about the output, creator and description. We also keep track of requests and plans for new scripts, marked as "pending" under the status field.

Surprisingly, pasting images "as values" within a Google Spreadsheet cell does not ensure that they will appear consistently - often the cells are just blank upon loading. This makes the idea of using an embedded sheet to display reaction schemes within a wiki lab notebook page not practical. However, using the scripts and a template to generate the scheme by just typing the name, SMILES or CSID for the reactants and product is a very efficient way to generate a consistent look for schemes within a notebook. It only requires a final step of taking the image of the screen and cropping using Paint. For example, here is a scheme thus generated for UC-EXP269.

Taking into account all of these factors, the reaction template sheet we provide does not have by default any scripts running within cells (except for the images). However, it is set up to quickly adapt to other reactions for planning amounts of reactants (by weight or volume), calculating concentrations, yields, melting points (experimental and predicted), solubilities, links to ChemSpider, 2D rendering of structures (including full schemes) and links to interactive NMR spectra using ChemDoodle. It simply requires users to hit one of the 3 drop-down menus (gChem, gCDK or gONS) and select the appropriate script for a particular cell.

Even if the user does not want to use this particular reaction template it still makes sense to make a copy of the template sheet because it is an easy way to copy all of the necessary Google Script without opening the editor.

Labels: , , ,

Friday, July 01, 2011

Open Notebook Science Talk at HUBbub 2011

On April 6, 2011 I presented at the HUBzero Conference in Indianapolis on "Open Notebook Science: Does Transparency Work?".
This presentation will first describe Open Notebook Science, the practice of making the laboratory notebook and all associated raw data available to the public in real time. Examples of current applications in organic chemistry - solubility and chemical reactions - will be detailed. Key details of the current technical implementation will be described and possible applicability to nanotechnology projects will be explored. Finally, the implications for Intellectual Property protection, claims of priority, subsequent publication in peer reviewed journals and the eventual automation of the scientific process will be explored.
The organizers did a great job in making the recording available as either a video or audio podcast.

I learned a great deal at the conference about how researchers from various fields use the HUBzero software to manage and share their data. As described on their website:
HUBzero® is a platform used to create dynamic web sites for scientific research and educational activities. With HUBzero, you can easily publish your research software and related educational materials on the web.
Although the system is not primarily designed for completely Open sharing, I did get the impression that for some applications there was significant interest in making data and processes more Open. There is certainly an enthusiastic user community around HUBzero - check out the recordings for some of the other talks here.

Labels: , ,

Creative Commons Attribution Share-Alike 2.5 License