Wednesday, May 25, 2011

More Open Melting Points from EPI and other sources: on the path to ultimate curation

As recently as 2008, Hughes et al published a paper asking: Why Are Some Properties More Difficult To Predict than Others? A Study of QSPR of Solubility, Melting Point, and Log P
The question then is: why do QSPR models consistently perform significantly worse with regard to melting point? In the Introduction, we proposed three reasons for the failure of QSPR models: problems with the data, the descriptors, or the modeling methods. We find issues with the data unlikely to be the only source of error in Log S, Tm, and Log P predictions. Although the accuracy of the data provides a fundamental limit on the quality of a QSPR model, we attempted to minimize its influence by selecting consistent, high quality data... With regards to the accuracy of Tm and Log P data, both properties are associated with smaller errors than Log S measurement. Moreover, the melting point model performed the worst, yet it is by far the most straightforward property to measure...We suggest that the failure of existing chemoinformatics descriptors adequately to describe interactions in the crystalline solid phase may be a significant cause of error in melting point prediction.
Indeed, I have often heard that melting point prediction is notoriously difficult. This paper attempted to discover why and suggested that it is more likely that the problem is related to a deficiency in available descriptors rather than data quality. The authors seem to argue that taking a melting point is so straightforward that the resulting dataset is almost self-evidently high quality.

I might have thought the same before we started collecting melting point datasets.

It turns out that validating melting points can be very challenging and we have found enormous errors - even cases where the same compound in the same dataset is assigned very different melting points. Under such conditions it is mathematically impossible to obtain high correlations between predicted and "measured" values.

Since we have no additional information to go on (no spectral proof of purity, reports of heating rate, observations of melting behavior, etc.) the only way we can validate data points is to look for strong convergence from multiple sources. For example, consider the -130 C value for the melting point of ethanol (as discussed previously in detail). It is clearly an outlier from the very closely clustered values near -114 C.


This outlier value is now highlighted in red to indicate that it was explicitly identified to not be used in calculating the average. Andrew Lang has now updated the melting point explorer to allow a convenient way to select or deselect outliers and indicate a reason (service #3). For large separate datasets - such as the Alfa Aesar collection - this can be done right on the melting point explorer interface with a click. For values recorded in the Chemical Information Validation sheet, one has to update the spreadsheet directly.

This is the same strategy that we used for our solubility data - in that case by marking outliers with "DONOTUSE". This way, we never delete data so that anyone can question our decision to exclude data points. Also by not deleting data, meaningful statistical analyses of the quality of currently available chemical information can be performed for a variety of applications.

The donation of the Alfa Aesar dataset to the public domain was instrumental in allowing us to start systematically validating or excluding data points for practical or modeling applications. We have also just received confirmation that the entire EPI (PhysProp) melting point dataset can be used as Open Data. Many thanks to Antony Williams for coordinating this agreement and for approval and advice from Bob Boethling at the EPA and Bill Meylan at SRC.

In the best case scenario, most of the melting point values will quickly converge as in the ethanol case above. However, we have also observed cases where convergence simply doesn't happen.

Consider the collection of reported melting points for benzylamine.


One has to be careful when determining how many "different" values are in this collection. Identical values are suspicious since they may very well originate from the same ultimate source. Convergence for the ethanol value above is credible because most of the values are very close but not completely identical, suggesting truly independent measurements.

In this case values actually diverge into sources of either +10 C, - 10 C, -30 C or about -45 C. If you want to play the "trusted source" game, do you trust more the Sigma-Aldrich value at +10C or the Alfa Aesar value at -43 C?

Lets try looking at the peer-reviewed literature. A search on SciFinder gives the following ranges:


The lowest melting point listed there is the +10C value we already have in our collection but these references are to other databases. The lowest value from a peer-reviewed paper is 37-38 C.

This is strange because I have a bottle of benzylamine in my lab and it is definitely a liquid. Investigating the individual references reveals a variety of errors. In one, benzylamine is listed as a product but from the context of the reaction it should be phenylbenzylamine:


(In a strange co-incidence the actual intermediate - benzalaniline - is the imine that Evan Curtain has synthesized recently in order to measure its solubility)

In another example, the melting point of a product is incorrectly associated with the reactant benzylamine:

The erroneous melting points range all the way up to 280 C and I suspect that many of these are for salts of benzylamine, as I reported previously for the strychnine melting point results from SciFinder.

With no other obvious recourse from the literature to resolve this issue, Evan attempted to freeze a sample of benzylamine from our lab.(UC-EXP265)


Unfortunately, the benzylamine sample proved to be too impure (<85% by NMR) and didn't solidify even down to -78 C. We'll have to try again from a much more pure source. It would be useful to get reports from a few labs who happen to have benzylamine handy and provide proof of purity by NMR and a pic to demonstrate solidification.

As most organic chemists will attest, amines are notorious for appearing as oils below their melting points in the presence of small amounts of impurities. I wonder if the divergence of melting points in this case is due to this effect. By providing NMR data from various samples subjected to freezing, it might be possible to quantify the effect of purity on the apparent freezing point. I think the images of the solidification are also important because I think that some may mistake very high viscosity with actual formation of a solid. At -78 C we observed the sample to exhibit a viscosity similar to that of syrup.

Our model predicts a melting point of about -38 C for benzylamine and so I suspect that the values of -43 C and -46 C are most likely to be close to the correct range. Lets find out.

2 Comments:

At 12:23 AM, Blogger Egon Willighagen said...

"We suggest that the failure of existing chemoinformatics descriptors adequately to describe interactions in the crystalline solid phase may be a significant cause of error in melting point prediction."

Molecular descriptors, such as calculated with the CDK, Dragon, JOELib, OpenTox, and many others, reflect properties of the molecule. There is no information in such descriptors about the solid. The crystal packing is one aspect that determines melting point, however. As such, molecular descriptors do not capture any information that.

 
At 10:23 AM, Blogger Jean-Claude Bradley said...

Egon - you raise a very interesting point. You are right that 2D descriptors alone can't account for 3D crystal packing - but couldn't 3D descriptors tackle that? Nevertheless using only 2D descriptors (from the CDK) Andy's random forest Model002 does pretty well (R2 0.8) when the dataset is highly curated.( http://onschallenge.wikispaces.com/MeltingPointModel002 )

However, we do see prediction problems with some compounds like fumaric and maleic acids - Model002 predicts the same mp and these are significantly different experimentally.

 

Post a Comment

Links to this post:

Create a Link

<< Home

Creative Commons Attribution Share-Alike 2.5 License