A few weeks ago, John Shirley - Global Marketing Manager at Alfa Aesar - contacted me to discuss the Chemical Information Validation results I posted from my 2010 Chemical Information Retrieval class. Our research showed that Alfa Aesar was the second most common source of chemical property information from the class assignment.
We explored some possible ways that we could collaborate. With our recent report of the use of melting point measurements to predict temperature solubility curves, the Alfa Aesar melting point data collection could prove immensely useful for our Open Notebook Science solubility project.
However, since we are committed to working transparently, the only way we could accept the dataset is if it were shared as Open Data. I am extremely pleased to report that Alfa Aesar has agreed to this requirement and we hope that this gesture will encourage other chemical companies to follow suit.
The initial file provided by Alfa Aesar did not store melting points in a database ready format - it included ranges, non-numeric characters and entries reporting decomposition or sublimation. One of benefits we could provide back to the company was cleaning up the melting point field to pure numerical values ready for sorting and other database processing. This processed collection contains 12986 entries. Note that these entries are not necessarily different chemical compositions since they refer to specific catalog entries with different purities or packaging.
For our purposes of prioritizing organic chemicals for solubility modeling and applications we curated this initial dataset by collapsing redundant chemical compositions and excluded inorganics (including organometallics) and salts. We did retain organosilicon, organophosphorus and organoboron compounds. Because the primary key for all of our projects depend on ChemSpiderIDs, all compounds were assigned CSIDs by deposition in the ChemSpider database if necessary. SMILES were also provided for each entry, as well as a corresponding link to the Alfa Aesar catalog page. This curated collection contains 8739 entries.
For completeness, we thought it would be useful to merge the Alfa Aesar curated dataset with other collections for convenient federated searches. We thus added the Karthikeyan melting point dataset, which has been used in several cases to model melting point predictions. This dataset was downloaded from Cheminformatics.org. Although we were able to use most of the structures in that collection, a few hundred were left out because of some difficulty in resolving some of the SMILES, perhaps related to the differences in algorithms used by OpenBabel and OpenEye. Hopefully this issue will be resolved in a simple way and the whole dataset can be incorporated in the near future. This final curated collection contains 4084 entries.
Similarly the smaller Bergstrom dataset was included after processing the original file to a curated collection of 277 drug molecules.
Finally, the melting point entries from the ChemInfo Validation sheet itself, generated by student contributions, is added to amount to a collection of currently 13,436 Open Data melting point values. We believe that this is currently the largest such collection and that it should facilitate the development of completely transparent and free models for the prediction of melting points. As we have argued recently, improved access to measured or predicted melting points is critical to the prediction of the temperature dependence of solubility.
In addition to providing the melting point data in tabular format, Andrew Lang has created a convenient web based tool to explore the combined dataset. A drop down menu at the top allows quick access to a specific compound and reports the average melting point as well as a link to the information source. In the case of an Alfa Aesar source, a link to the catalog is provided, where the compound can be conveniently ordered if desired.
In another type of search, a SMARTS string can be entered with an optional range limit for the melting points. In the following example 14 hits are obtained for benzoic acid derivatives with melting points between 0C and 25C. Clicking on an image will reveal its source. (BTW even if you don't know how to perform sophisticated SMARTS queries, simply looking up the SMILES for a substructure on ChemSpider or ChemSketch will likely be sufficient for most types of queries).
Preliminary tests on a Droid smartphone indicate that these search capabilities work quite well.
Finally, I would like to thank Antony Williams, Andrew Lang and the people at Alfa Aesar (now added as an official sponsor) who contributed many hours to collecting, curating and coding for the final product we are presenting here. We hope that this will be of value to the researchers in the cheminformatics community for a variety of open projects where melting points play a role.
Jean-Claude, Andrew, others, congratulations!
ReplyDeleteMay I request, though, clear statements on the open data nature, for example, via a CCZero waiver, e.g. at the download page?
Egon - sure we'll put a CC0 logo as we do with the solubility data
ReplyDeleteGreat! I added a CKAN entry:
ReplyDeletehttp://ckan.net/package/open-melting-point-data
Thanks Egon! I added my email under the "maintainer" if there are any questions.
ReplyDeleteIt was interesting, and I might add, rather torturous work :-) The issues with SMILES should certainly be taken up with Noel as I think he would like to see what we were dealing with in regards to the structure set. DO you want to chat with him or shall I? There is a conversation going on right now about SMILES over at the Blue Obelisk Shapado site...
ReplyDeleteThe in text decribed web based tool to explore the combined data set (http://lxsrv7.oru.edu/~alang/meltingpoints/) does not really work (at least today, 05/22/2013), it does not.
ReplyDeleteThanks for letting us know - that service is down for me as well - however the other services seem to be ok at http://onswebservices.wikispaces.com/
ReplyDelete