A few weeks ago,
John Shirley - Global Marketing Manager at
Alfa Aesar - contacted me to discuss the
Chemical Information Validation results I posted from my 2010 Chemical Information Retrieval class. Our research showed that Alfa Aesar was the second most common source of chemical property information from the class assignment.
We explored some possible ways that we could collaborate. With our recent report of the use of
melting point measurements to predict temperature solubility curves, the Alfa Aesar melting point data collection could prove immensely useful for our
Open Notebook Science solubility project.
However, since we are committed to working transparently, the only way we could accept the dataset is if it were shared as Open Data. I am extremely pleased to report that Alfa Aesar has agreed to this requirement and we hope that this gesture will encourage other chemical companies to follow suit.
The initial file provided by Alfa Aesar did not store melting points in a database ready format - it included ranges, non-numeric characters and entries reporting decomposition or sublimation. One of benefits we could provide back to the company was cleaning up the melting point field to pure numerical values ready for sorting and other database processing. This processed
collection contains 12986 entries. Note that these entries are not necessarily different chemical compositions since they refer to specific catalog entries with different purities or packaging.
For our purposes of prioritizing organic chemicals for solubility modeling and applications we curated this initial dataset by collapsing redundant chemical compositions and excluded inorganics (including organometallics) and salts. We did retain organosilicon, organophosphorus and organoboron compounds. Because the primary key for all of our projects depend on ChemSpiderIDs, all compounds were assigned CSIDs by deposition in the ChemSpider database if necessary. SMILES were also provided for each entry, as well as a corresponding link to the Alfa Aesar catalog page. This
curated collection contains 8739 entries.
For completeness, we thought it would be useful to merge the Alfa Aesar curated dataset with other collections for convenient federated searches. We thus added the
Karthikeyan melting point dataset, which has been used in several cases to model melting point predictions. This
dataset was downloaded from
Cheminformatics.org. Although we were able to use most of the structures in that collection, a few hundred were left out because of some difficulty in resolving some of the SMILES, perhaps related to the differences in algorithms used by OpenBabel and OpenEye. Hopefully this issue will be resolved in a simple way and the whole dataset can be incorporated in the near future. This
final curated collection contains 4084 entries.
Similarly the smaller
Bergstrom dataset was included after processing the
original file to a
curated collection of 277 drug molecules.
Finally, the melting point entries from the
ChemInfo Validation sheet itself, generated by student contributions, is added to amount to a collection of currently
13,436 Open Data melting point values. We believe that this is currently the largest such collection and that it should facilitate the development of completely transparent and free models for the prediction of melting points. As we have
argued recently, improved access to measured or predicted melting points is critical to the prediction of the temperature dependence of solubility.
In addition to providing the melting point data in tabular format, Andrew Lang has created a convenient
web based tool to explore the combined dataset. A drop down menu at the top allows quick access to a specific compound and reports the average melting point as well as a link to the information source. In the case of an Alfa Aesar source, a link to the catalog is provided, where the compound can be conveniently ordered if desired.
In another type of search, a SMARTS string can be entered with an optional range limit for the melting points. In the following example 14 hits are obtained for benzoic acid derivatives with melting points between 0C and 25C. Clicking on an image will reveal its source. (BTW even if you don't know how to perform sophisticated SMARTS queries, simply looking up the SMILES for a substructure on
ChemSpider or ChemSketch will likely be sufficient for most types of queries).
Preliminary tests on a Droid smartphone indicate that these search capabilities work quite well.
Finally, I would like to thank
Antony Williams,
Andrew Lang and the people at
Alfa Aesar (now added as an
official sponsor) who contributed many hours to collecting, curating and coding for the final product we are presenting here. We hope that this will be of value to the researchers in the cheminformatics community for a variety of open projects where melting points play a role.
Labels: melting point, open data