Saturday, June 18, 2011

Google Apps Scripts for an intuitive interface to organic chemistry Open Notebooks

Rich Apodaca recently demonstrated how Google Apps Scripts can be added to Google Spreadsheets to enable simple calling of web services for chemistry applications (gChem). Although we have been using web service calls from within a Google spreadsheet for some time (solubility calculation by NMR link #3 and misc chem conversions link #1), the process wasn't as intuitive as it could be because one had to find then paste lengthy urls.

Rich's approach enables simply clicking the desired web service from a menu on Google Spreadsheets and these functions have simple names like getSMILES. Andrew Lang has now added several web services from our ONS projects and the CDK. There are now 3 menus to choose from: gChem, gCDK and gONS.


To demonstrate the power of these tools consider the rapid construction of a customized interface to an experiment in a lab notebook (in this example UC-EXP263).

1) Because Andy has added a gONS service to render images of molecules from ChemSpider, consistent reaction schemes can now be constructed from this template by simply typing the name of the reactants and products then embedding in the wiki.



2) Planning of the reaction to calculate reactant amounts and product yield can then be processed by simply typing the name of the chemicals. Services calling molecular weight and density are automatic based on the chemical name as input.


3) Typing the name of the solvent then allows easy access to the solubility properties of the reaction components. The calculated concentrations of the reactants and product can be directly compared with their measured maximum solubility. In this experiment the observed separation of the product from the solution is consistent with these measurements.

4) Both experimental and predicted melting points (using Model002) can then be lined up for comparison. A large discrepancy between the two would flag a possible error - in this case good agreement is found. Noting that the product's melting point is near room temperature (53 C) explains why two layers were were observed to form during the course of the reaction and cooling to 0 C induced the product to precipitate. Links to the melting measurements are also provided in column N for easy exploration.

5) Column O provides a quick link to the ChemSpider entries for all compounds and column P provides links to the Reaction Attempts Explorer where, for example, one can explore other reactions where the product was involved. Finally columns Q and R provide one click access to an interactive NMR spectrum of the product, powered by ChemDoodle.

The last few columns still use our older code to call web services but over time these should be added to the gONS collection for convenience.

The easiest way to experiment with this interface is probably to just make a copy (File -> Make a Copy from the Google Spreadsheet menu). The sheet can then be customized for other applications.

Labels: , , ,

Thursday, June 16, 2011

My talk at SLA on Trust in Science and Open Melting Point Collections

On June 14 and 15, 2011 I attended the Special Libraries Association conference and made presentations on two panels on the role of trust in science with a case-study of the Open Melting Point collections that Andrew Lang, Antony Williams and I have been assembling and curating.

The first panel was on the "International Year of Chemistry: Perils and Promises of Modern Communication in the Sciences". My colleague Laurence Souder from the Department of Culture and Communications at Drexel presented on "Trust in Science and Science by Blogging", using as an example the NASA press release on arsenic replacing phosphorus in bacteria and subsequent controversy taking place in the blogosphere. (see post in Scientific American blog today)

Watch Lawrence Souder's presentation screencast and slides.

The second panel was on "New Forms of Scholarly Communications in the Sciences". Don Hagen from the National Technical Information Service presented on "NTIS Focus on Science and Data: Open and Sustainable Models for Science Information Discovery" and Dorothea Salo discussed the evolving role of libraries and institutional repositories on scholarly communication and archiving.

Watch Don Hagen's presentation screencast and slides.

My own slides and screencast from the second panel are available below:



Labels: , , , ,

Saturday, June 11, 2011

More on 4-benzyltoluene and the impact of melting point data curation and transparency

There are many motivations for performing scientific research. One of these is the desire to advance public scientific knowledge.

This is a difficult concept to quantify or even qualitatively assess. One can try to use literature citations and impact factors but that captures only a small fraction of the true scientific impact. For example, one formal citation of our solubility dataset doesn't represent the 100,000 anonymous solubility queries made directly to our database. And of these the actual impact will depend on exactly how the information was used. Egon Willighagen has identified this as a problem for the Chemistry Development Kit (CDK) as well: many more people use the CDK than reflected simply by the number of citations to the original paper.

There are a few of us who believe that curating chemistry data is a high impact activity. Antony Williams spends a considerable amount of time on this activity and frequently uncovers very serious errors from a number of data sources. Andrew Lang and I have put in a similar effort in collecting and curating solubility measurements openly - and recently (with Antony) we have been doing the same for melting points.

Although attempting to estimate the total impact of the curation activity isn't really practical, we can look at a specific and representative example to capture the scope.

I recently exposed the situation with the melting point measurements of 4-benzyltoluene. In brief, the literature provided contradictory information that could not be resolved without performing an experiment. Although an exact measurement was not found, a limit was determined that ruled out all measurements except for one.

Ironically it turns out that the melting point of this compound is its most important property for industrial use! Derivatives of diphenylmethane were sought out to replace PCBs as electrical insulating oils for capacitors because of toxicity concerns. As described in this patent (US5134761), for this application one requires the oil to remain liquid down to -50 C. Another key requirement is the ability to absorb hydrogen gas liberated at the electrode surface (a solubility property). Since this is optimal for smaller alkyl groups on the rings, it places benzyltoluene isomers at the focal point of research for this application.

The patent states: "According to references, the melting points of the position isomers of benzyltoluenes are as follows..." but does not make a specific reference. However, by comparing the numbers with other sources we can presume that the reference is the Lemneck1954 paper I discussed previously.

The patent then uses these melting points to calculate the melting behavior of mixtures of these isomers, as they obtain without further purification from a Friedel-Crafts reaction.


If our results are correct and the melting point of 4-benzyltoluene is not +4.6 C but well below -15 C, then the calculated properties in the patent may be significantly in error as well. With the information available thus far from our experiments (UC-EXP266), we think it is unlikely that the +4.6 C value can be correct because we observed no solidification after 2 days at -15 C. The patent reports that solidification of some viscous mixtures took up to a full week but we did not observe an appreciable increase in viscosity for 4-benzyltoluene at -15 C. But in order to be sure we will first freeze the sample again below -40 C and let it warm up to -15 C in the freezer and confirm that it melts completely.


It is in light of this analysis that I make the case that open curation of melting point data is likely to be a high impact activity relative to the amount of time required to perform it. The problem is that errors such as these cascade through the scientific record and likely retard scientific progress by causing confusion and wasted effort. Consider the total cost in terms of research and legal fees for just one patent. As I discussed previously, consider the effect of compromised and contradictory data now known to exist within training sets on the pace of developing reliable melting point models (cascading down to solubility models dependent upon melting point predictions or measurements - and ultimately cascading to the efficiency of drug design).

It is important to note that the benefits of curation would be greatly diminished without the component of transparency. We are not claiming to provide a "trusted source" of melting point data. There is no such thing - and operating under the illusion of the trusted source model has resulted in the mess we are in now - with multiple melting point values for the same compound cascading and multiplying to different databases (a good and still unresolved example is benzylamine).

What we are doing is reporting all the sources we can use and marking some sources as DONOTUSE so they are not included in the calculation of the average - with an explanation. We never delete data so users can make informed choices and not be in a position of having to trust our judgement. If someone does not agree with me that failure to freeze after 2 days at -15 C does not necessarily rule out the +4.6 C value for the melting point for 4-benzyltoluene then they are free to use it.

Using a trusted source model, all values within a collection are equally valid. In the transparency model not all values are equal - we are justifiably more confident in a melting point value near -114 C for ethanol than for a melting point with a single source (like this compound).

And finally, an important factor for having an impact on science is discoverability. It is likely that someone doing research involving the melting behavior of 4-benzyltoluene would perform at least quick Google search. What they are likely to find is not just a simple number without provenance but rather a collection of results capturing the full subtlety of the situation under discussion. This is a natural outcome of working transparently.

Labels: ,

Thursday, June 09, 2011

The quest to determine the melting point of 4-benzyltoluene

I recently reported that we are attempting to curate the open melting point measurements collected from multiple sources such as Alfa Aesar, PhysProp (EPIsuite) and several smaller collections. I mentioned that some values - like benzylamine - simply don't converge and the only way to resolve the issue is to actually get a high purity sample and do a measurement.

Since that report, we found another non-converging situation with 4-benzyltoluene. As shown below, reported measurements range from -30 C to 125C.

The values in red have been removed from the calculation of the average based on evidence we obtained from ordering the compound from TransWorld Chemicals and observing its behavior when exposed to various temperatures. The details can be found from UC-EXP266 (which I performed with Evan Curtin).

Immediately after opening the package it was clear that the compound was a liquid and thus the 125C and 98.5C values became improbable enough to remove.


First Evan Curtin and I dropped the still sealed bottle into an ice bath (0C) and after 10 minutes there was no trace of solidification.


At this point, this does not necessarily rule out the values near 5C because of the short time in the bath.

We then used an acetone/dry ice bath and did see a rapid and clear solidification after reaching -30C to -35C.



Letting the bath temperature rise it was difficult to tell what was happening but there seemed to be some liquefaction around -12C.

In order to get a more precise measurement, we transferred about 2 mls of the sample into a test tube and introduced the thermometer directly in contact with the substance. After quickly freezing the contents in a dry ice/acetone bath, the sample was removed and its behavior was observed over time, as shown below.


I was expecting to see the internal temperature rise then plateau at the melting point until all the solid disappeared and then finally observe a second temperature rise. This comes from experience in making 0C baths within minutes by simply throwing ice into pure water.

As shown above that is not at all what happened. The liquid formed gradually starting at about -9C and never reached a plateau even up to +7C, where there was still much solid left.

If we look at the method used to generate the 4.58 C value (Lamneck1954) we find that a similar method was cited - but not actually described there. The actual curves are not available either. However, this paper provides melting points for several compounds within a series, which is often useful for spotting possible errors - unless of course these are systematic errors. In this particular case it doesn't help much because the 2-methyl derivative is similar but the 3-methyl analogue is very close to -30 C value listed in our sources.

Notice that one of the "melting points" (3-methyldicyclohexylmethane) is not even measurable because it forms a glass. It is easy to see how melting points below room temperature can generate very different values - and very difficult to assess if the full experimental details of the measurements are not reported.

Trying to get at more details lets look at the referenced paper (Goodman1950). Indeed the researchers determine the melting point by plotting the temperature over time as the sample is heated and looking for a plateau. The obvious difference is that the heating rate is about an order of magnitude slower than in our experiment.
This paper also highlights the fact that there are more twists and turns in the melting point story. One compound (2-butylbiphenyl) was found to have 2 melting points that can be observed by seeding with different polymorphic crystals.


At this point, our objective of obtaining an actual melting point was replaced with trying to at least mark a reasonably confident upper limit. After leaving the sample at -15 C in a freezer for two days, no solidification was observed - not even an appreciable increase in viscosity. For this reason, all melting point values above -15C were removed from the calculation of the average and show up in red.

With only the -30 C measurement left, this is now the default value for 4-benzyltoluene - until further experimentation.

Labels:

Monday, February 21, 2011

Alfa Aesar melting point data now openly available

A few weeks ago, John Shirley - Global Marketing Manager at Alfa Aesar - contacted me to discuss the Chemical Information Validation results I posted from my 2010 Chemical Information Retrieval class. Our research showed that Alfa Aesar was the second most common source of chemical property information from the class assignment.
We explored some possible ways that we could collaborate. With our recent report of the use of melting point measurements to predict temperature solubility curves, the Alfa Aesar melting point data collection could prove immensely useful for our Open Notebook Science solubility project.

However, since we are committed to working transparently, the only way we could accept the dataset is if it were shared as Open Data. I am extremely pleased to report that Alfa Aesar has agreed to this requirement and we hope that this gesture will encourage other chemical companies to follow suit.

The initial file provided by Alfa Aesar did not store melting points in a database ready format - it included ranges, non-numeric characters and entries reporting decomposition or sublimation. One of benefits we could provide back to the company was cleaning up the melting point field to pure numerical values ready for sorting and other database processing. This processed collection contains 12986 entries. Note that these entries are not necessarily different chemical compositions since they refer to specific catalog entries with different purities or packaging.

For our purposes of prioritizing organic chemicals for solubility modeling and applications we curated this initial dataset by collapsing redundant chemical compositions and excluded inorganics (including organometallics) and salts. We did retain organosilicon, organophosphorus and organoboron compounds. Because the primary key for all of our projects depend on ChemSpiderIDs, all compounds were assigned CSIDs by deposition in the ChemSpider database if necessary. SMILES were also provided for each entry, as well as a corresponding link to the Alfa Aesar catalog page. This curated collection contains 8739 entries.

For completeness, we thought it would be useful to merge the Alfa Aesar curated dataset with other collections for convenient federated searches. We thus added the Karthikeyan melting point dataset, which has been used in several cases to model melting point predictions. This dataset was downloaded from Cheminformatics.org. Although we were able to use most of the structures in that collection, a few hundred were left out because of some difficulty in resolving some of the SMILES, perhaps related to the differences in algorithms used by OpenBabel and OpenEye. Hopefully this issue will be resolved in a simple way and the whole dataset can be incorporated in the near future. This final curated collection contains 4084 entries.

Similarly the smaller Bergstrom dataset was included after processing the original file to a curated collection of 277 drug molecules.

Finally, the melting point entries from the ChemInfo Validation sheet itself, generated by student contributions, is added to amount to a collection of currently 13,436 Open Data melting point values. We believe that this is currently the largest such collection and that it should facilitate the development of completely transparent and free models for the prediction of melting points. As we have argued recently, improved access to measured or predicted melting points is critical to the prediction of the temperature dependence of solubility.

In addition to providing the melting point data in tabular format, Andrew Lang has created a convenient web based tool to explore the combined dataset. A drop down menu at the top allows quick access to a specific compound and reports the average melting point as well as a link to the information source. In the case of an Alfa Aesar source, a link to the catalog is provided, where the compound can be conveniently ordered if desired.
In another type of search, a SMARTS string can be entered with an optional range limit for the melting points. In the following example 14 hits are obtained for benzoic acid derivatives with melting points between 0C and 25C. Clicking on an image will reveal its source. (BTW even if you don't know how to perform sophisticated SMARTS queries, simply looking up the SMILES for a substructure on ChemSpider or ChemSketch will likely be sufficient for most types of queries).

Preliminary tests on a Droid smartphone indicate that these search capabilities work quite well.

Finally, I would like to thank Antony Williams, Andrew Lang and the people at Alfa Aesar (now added as an official sponsor) who contributed many hours to collecting, curating and coding for the final product we are presenting here. We hope that this will be of value to the researchers in the cheminformatics community for a variety of open projects where melting points play a role.

Labels: ,

Friday, November 20, 2009

CAS curates strychnine m.p. - ChemInfo Class 9

What is going to distinguish chemistry databases as we move forward in this Web2.0 world?

If I was unsure of it when I started teaching Chemical Information Retrieval 2 months ago, I certainly got my answer yesterday afternoon. Cristian Dumitrescu from CAS contacted me to discuss the problems I had encountered when attempting to use SciFinder to find the melting point of strychnine. He had read my blog post and wanted to make sure he understood the problem. So I had a conference call with him and a CAS colleague and I explained that several m.p. values corresponded to strychnine salts instead of the free base. They agreed to rectify the situation.

Apparently Cristian stays on top of what is being said about CAS products from various sources, including the blogosphere. I think that what will distinguish chemistry databases as we move forward is precisely this type of proactivity and responsiveness.

There are a plethora of databases out there to search for chemical information. Most of them contain surprisingly significant amounts of incorrect data. My students are in the process of demonstrating that with their assignment on finding 5 sources for 5 properties of a chemical of their choice. When they are done in 2 weeks I'll post about that, perhaps doing a top 10 worst data points.

CAS is an example of a commercial database. But the same principle applies to free databases as well.

Consider the glatiramer acetate problem I reported on previously. ChemSpider immediately removed the entry because a random polymer was being incorrectly represented as a physical mixture of amino acids. As far as I know no other free databases have corrected the problem, although contact information for people running various databases was provided by Michael Kuhn and Egon Willighagen on FriendFeed.

I spoke with Cristian about the problem and he said he would look into it. Upon doing a search for glatiramer acetate on SciFinder it appears that there is currently a problem. The text correctly explains that this is a polymer but the empirical formula looks like just a physical mixture of amino acids, with an extra H2O per unit that should not be there after amide formation. But this was minor compared to the problems I reported on previously - for example there were no incorrectly calculated molecular properties, although the images did not represent the structure of the polymer.
This has been a good week for curation. Yesterday Nick successfully completed the evaluation of the stereochemistry of nargenicin and submitted the corrected SMILES to ChemSpider. Tony Williams has already incorporated the fix and now a search for nargenicin on ChemSpider gives just one entry.

Tony has provided several such puzzles for my students and a few are close to resolving the structures. The main problem is that the structures were entered into ChemSpider with at least one undefined stereocenter. Finding the correct structure from the primary literature can be very challenging for structures of this complexity but it certainly puts the chemical information retrieval methods I am teaching my students to good use.

The class itself was short - and covered mainly just details of student assignments - since we won't have much time during the last class on December 3, 2009 for a workshop. Rajarshi Guha and Tony Williams will be my guest lecturers on that day.

Labels: , , , , ,

Thursday, November 05, 2009

Sixth Cheminfo Retrieval class: What is the m.p. of strychnine?

It would seem to be a simple task to find the melting point of a well known alkaloid like strychnine. Our quest to answer that question - and other simple properties - in class using both freely available and commercial databases reveals how treacherous it can be. In the end we don't find an unambiguous answer but we uncover enough information for many applications.

The take home message is that chemists need to be constantly paranoid that their information - whether from their lab or the most prestigious journals - can easily be wrong. Strategies such as finding multiple sources and investigating the experimental details provided in the primary sources are demonstrated to diminish uncertainty. But this is often not easy or quick.

Here is a summary of the lecture:

This is the lecture from the sixth Chemical Information Retrieval class at Drexel University on October 29, 2009. It starts with a review of some of the new questions answered by students from the chemistry publishing FAQ, which covers patent information and accessing electronic journals at Drexel. Tony Williams submitted a puzzle to resolve conflicting structures in ChemSpider, which is too difficult to be a regular assignment. It requires re-analyzing spectroscopic data in papers where stereochemical assignments are determined. An example is paromomycin which has three entries. The regular assignment for the week is then introduced and it involves obtaining 5 different sources each for 5 different properties for a molecule of the student's choosing. To demonstrate how to do the assignment strychnine is chosen as an example. Melting point information is obtained from ChemSpider (ultimately an MSDS sheet), Wikipedia, Wolfram Alpha and in a JACS article via SciFinder. By investigating primary sources several errors are found in SciFinder, where the recorded melting points correspond to salts of the alkaloid. Difficulties in finding primary sources for the melting point from Wikipedia are highlighted. For LD50 information Wikipedia did not even provide proper units (mg instead of mg/kg and no animal or route specified). The importance of ChemSpider predicted values for density and boiling point is demonstrated as a corroborating tool. In the end the reported melting point range of strychnine from the JACS paper did not even overlap with the reference to which it was compared. The exercise is meant to highlight the importance of caution in obtaining values from all available sources. Even the seemingly simple question of determining the melting point of well known alkaloid cannot be answered definitively.

Labels: , , , , , ,

Creative Commons Attribution Share-Alike 2.5 License