Monday, September 28, 2009

A First General Solubility Model from ONS Challenge Data

After about a year, the Open Notebook Science Solubility Challenge has resulted in over 680 measurements, with about an additional 100 from the literature. Taking into account averaged repeated measurements, discarding some erroneous results and considering only organic solids (so far all of our liquid solutes have proven to be miscible in our solvents), that leaves us with 244 unique values.

Andrew Lang has created a general model (Model003) to predict solubility based on molecular descriptors of both the solutes and solvents. Previous models, such as Rajarshi Guha's Model002 were built only for selected solvents.

Predictions can be made from this web page by entering the SMILES of the solute and optionally the SMILES, dipole moment and dielectric constant of any solvent (convenient sources for these are Wolfram Alpha and Wikipedia). Boc-glycine with diethyl ether as an optional solvent is shown here.
The prediction service then looks up the relevant molecular descriptors from the CDK and makes predictions for some common solvents and the optional one if requested.

If the name of the solute was entered, the service will also report all of the experimental measurements for that solute from the ONS Challenge with links to the lab notebook pages.

There are a few objectives in making this public.

First, we think that it might provide some ideas about possible good or bad solvents for a given solute. The dataset is certainly not large enough to provide a truly general prediction of solubility in absolute terms. However, comparing relative values might be helpful in many cases. In the example above for boc-glycine, the model predicts that toluene would be the poorest solvent, which matches the order of the experimental values, even if the absolute values are not a close match. DMSO, THF, methanol and ethanol are predicted to be good solvents and this is reflected in the measurements.

Second, we want to make the model and data public so that other researchers with experience in this area can contribute their own models. We have been working with Marcin Wojnars from TunedIT to make it much easier for models to be submitted. Andy has just converted our dataset to ARFF format and it is available here. We should have more to report on this shortly.

By using molecular descriptors from the solvents we should be able to do predictions for solvent mixtures as well. At some point perhaps we can even include temperature.

The current model fits measurement with this type of distribution:
If we are able to build models automatically in real time after the addition of each data point, we should be able to set up automatic solubility measurement requests to minimize the amount of work it takes to improve each model. This is a step in that direction.

Labels: , , , ,


At 12:34 AM, Blogger Egon Willighagen said...

Happy to see a new model, but I am not quite convinced this model is any better than that of Rajarshi. Instead, for this model too, I'd say use this model like you used Rajarshi's model.

Andrew, can you please provide prediction values for an independent test set? 44 test structures out of the 244 should be fine, and plot those. I also like to hear the R^2 after y-randomization, and the RMSE for a simple model y_pred = y_mean_train.

I also very much like to know if you can color the plot by class, to see if some of the four classes actually does worse than the others, like acetonitril.

Moreover, this is a MLR model, not? Have you tried PLS or other modeling methods?

At 12:17 PM, Blogger Hiro Sheridan said...

Egon, I'll see what I can do. I was hoping I could get you, Rajarshi, Noel, etc. to improve on it - I think you guys could do better but I think it works as a proof of concept. :)

I agree that the model is not better than Rajarshi's but different. It includes descriptors from both the solute and the solvent, allowing for the prediction of solubility values of any solute in any solvent.

I think that's pretty cool.

At 1:30 PM, Blogger Egon Willighagen said...

Yeah, I will try to find some time and improve the model. While there still is a lot of variance for the number of variables, I think at least PLS should do somewhat better...

I do like the single model for all solvents too!

At 9:21 AM, Blogger Jean-Claude Bradley said...

Egon - I am glad you like the all solvents model - in initial discussions with Rajarshi we thought it was too ambitious. Now we have more measurements - including solvent mixtures so at least we can start working on it.


Post a Comment

Links to this post:

Create a Link

<< Home

Creative Commons Attribution Share-Alike 2.5 License