An initial evaluation by Andy found that the Alfa Aesar collection yielded better correlations with selected molecular descriptors compared to the Karthikeyan dataset (originally from MDPI), an open collection of melting points used by several researchers to provide predictive melting point models. This suggested that the quality of the Alfa Aesar dataset might be higher.
Inspection of the Karthikeyan dataset did reveal some anomalies that may account for the poor correlations. First there were several duplicates - identical compounds with different melting points, sometimes radically different (up to 176 C). A total of 33 duplicates (66 measurements) were found with a difference in melting points greater than 10 C.(see ONSMP008 dataset) Here are some examples.
A second problem we ran into involved difficulty processing the SMILES in the Karthikeyan collection. Most of these involved SO2 groups. An attempt to view this SMILES string in ChemSketch ends up with two extra hydrogens on the sulfur.
[S+2]([O-])([O-])(OCC#N)c1ccc(C)cc1Other SMILES strings render with 5 bonds on a carbon and ChemSketch draws these with a red X on the problematic atom. See for example this SMILES string:
O=C(OC=1=C2C=CC=CC2=NC=1c1ccccc1)C
Note that the sulfur compounds appear to render correctly on Daylight's Depict site:
In total 311 problematic SMILES from the Karthikeyan collection were removed (see ONSMP009).
With the accumulation of melting point sources, overlapping coverage is revealing likely incorrect values. For example, 5 measurements are reported for phenylacetic acid.
Four of the values cluster very close to 77 C and the other - from the Karthikeyan dataset - is clearly an outlier at 150 C.
In order to predict the temperature dependence for the solutes in our database, Andy collected the EPI experimental melting points, which are listed under the predicted properties tab in ChemSpider (ultimately from the EPA). (There are predicted EPI values there but we only used the ones marked exp).
This collection of 150 compounds was then listed in a spreadsheet (ONSMP010) and each entry was marked as having only an EPI value (44 compounds) or having at least one other measurement from another source (106 compounds). Out of those having at least one more value, 10 reported significant differences (> 5C) between the measurements. Upon investigation, many of these point strongly to the error lying with the EPI dataset. For example, the EPI melting point for phenyl salicylate is over 85 C higher than that reported by both Sigma-Aldrich and Alfa Aesar.
These preliminary results suggest that as much as 10% of the EPI experimental melting point dataset is significantly in error. Only a systematic analysis over time will reveal the full extent of the deficiencies.
So far the Alfa Aesar dataset has not produced many outliers, when other sources are available for comparison. However, even here, there are some surprising results. One of the most well studied organic compounds - ethanol - is listed with a melting point of -130 C by Alfa Aesar, clearly an outlier from the other values clustered around -114 C.
When downloading the Karthikeyan dataset from Cheminformatics.org, a Trust Level field indicates: "High - Original Author Data".
It would be nice if it were that simple. Unfortunately there are no shortcuts. There is no place for trust in science. The best we can do is to collect several measurements from truly independent sources and look for consensus over time. Where consensus is not obvious and information sources are exhausted, performing new measurements will be the only option left to progress.
The idea that a dataset has been validated - and can be trusted completely - simply because it is attached to a peer-reviewed paper is a dangerous one. This is perhaps the rationale used by projects such as Dryad, where datasets are not accepted unless they are associated with a peer-reviewed paper. Peer review was not designed to validate datasets - even if we wanted it to, reviewers don't typically have access to enough information to do so.
The usefulness of a measurement is related much more to the details in the raw data provided by following the chain of provenance (when available) than it is in where it is published. To be fair, in the case of melting point measurements, there really isn't that much additional experimental information to provide, except perhaps an NMR of the sample to prove that it was completely dry. In such a case, we have no choice but to use redundancy until a consensus number is finally reached.
Thanx for this post. This problem is long know, but the impact enormous. It is better if an organic chemist mentions it, as it will have more impact then, than if a cheminformatician says it.
ReplyDeleteThanks for the comment Egon - certainly the organic chemistry community has a vested interest in this, even if they don't participate in model building very often.
ReplyDeleteThere still exits some error in the melting point dataset of opennotebook excel sheets over 20 molecules.
ReplyDeleteM - which molecules?
ReplyDelete