Tuesday, March 22, 2011

Open modeling of melting point data

The contribution of Alfa Aesar melting point data to our open collection has facilitated the validation of a significant amount of the entire dataset. However, this process of curation is never-ending. A good example is the discovery of an error in one of the sources for the melting point of warfarin. Following David Weinberger's post about our melting point explorer, his brother Andy noticed a problem and this enabled us to fix it.

In a way, creating an open environment to make it easy to find and report errors - as well as add new data - complicates scientific evaluation. In order to report a reproducible process and outcome, it is necessary to take a snapshot of the dataset. Choosing the exact composition of a dataset for a particular application is somewhat arbitrary. Aside from selecting a threshold for excluding measurements that deviate too much, compounds may be excluded based on their type.

For the sake of clarity, we archived the various datasets we created from multiple sources with brief descriptions of the filtering and merging at each step. From the perspective of an organic chemist, ONSMP013 is probably the most useful at this time. It contains averaged measurements for 12634 organic compounds and excludes salts, inorganics or organometallics. The original file provided by Alfa Aesar contained several of these excluded compounds and can be obtained from ONSMP000. It might be interesting at some point to create a collection of melting points for inorganics or salts. We would welcome contributions of collections of melting points with different filters.

One of the advantages of ONSMP013 is that it is possible to generate CDK descriptors for each entry (and these are included in the spreadsheet). By not using commercial software to generate descriptors, it enables fully transparent modeling - and extension of that modeling by anyone.

With this in mind, Andrew Lang has used ONSMP013 to generate a Random forest melting point model (MPM002). The most important descriptors turned out to be the number of hydrogen bond donors and the Topological Polar Surface Area (TPSA). The scatter plot below shows the correlation (R2 = 0.79) between the predicted and experimental values. (color represents TPSA and size relates to H-bond donors)


Andy has described in much more detail the rationale for selecting the Random forest approach over a linear model in MPM001. He has also compared the performance of CDK descriptors versus those from a commercial program for a small set of drug melting points in MPM003.

The Random forest model (MPM002) is also now available as a web service by entering the ChemSpiderID (CSID) of a compound in a URL. See this example for benzoic acid. If experimental results exist they will appear on top and a link to obtain the predicted melting point will appear underneath.

Note that the current web service for predicting melting points can be slow - it may take a minute to process.

Additional web services for melting point data will be listed on the ONS web services wiki.

No comments:

Post a Comment