A First General Solubility Model from ONS Challenge Data
After about a year, the Open Notebook Science Solubility Challenge has resulted in over 680 measurements, with about an additional 100 from the literature. Taking into account averaged repeated measurements, discarding some erroneous results and considering only organic solids (so far all of our liquid solutes have proven to be miscible in our solvents), that leaves us with 244 unique values.
Andrew Lang has created a general model (Model003) to predict solubility based on molecular descriptors of both the solutes and solvents. Previous models, such as Rajarshi Guha's Model002 were built only for selected solvents.
Predictions can be made from this web page by entering the SMILES of the solute and optionally the SMILES, dipole moment and dielectric constant of any solvent (convenient sources for these are Wolfram Alpha and Wikipedia). Boc-glycine with diethyl ether as an optional solvent is shown here.
The prediction service then looks up the relevant molecular descriptors from the CDK and makes predictions for some common solvents and the optional one if requested.
If the name of the solute was entered, the service will also report all of the experimental measurements for that solute from the ONS Challenge with links to the lab notebook pages.
There are a few objectives in making this public.
First, we think that it might provide some ideas about possible good or bad solvents for a given solute. The dataset is certainly not large enough to provide a truly general prediction of solubility in absolute terms. However, comparing relative values might be helpful in many cases. In the example above for boc-glycine, the model predicts that toluene would be the poorest solvent, which matches the order of the experimental values, even if the absolute values are not a close match. DMSO, THF, methanol and ethanol are predicted to be good solvents and this is reflected in the measurements.
Second, we want to make the model and data public so that other researchers with experience in this area can contribute their own models. We have been working with Marcin Wojnars from TunedIT to make it much easier for models to be submitted. Andy has just converted our dataset to ARFF format and it is available here. We should have more to report on this shortly.
By using molecular descriptors from the solvents we should be able to do predictions for solvent mixtures as well. At some point perhaps we can even include temperature.
The current model fits measurement with this type of distribution:
If we are able to build models automatically in real time after the addition of each data point, we should be able to set up automatic solubility measurement requests to minimize the amount of work it takes to improve each model. This is a step in that direction.