Sunday, March 27, 2011

Towards the automated discovery of useful solubility applications

Last week, I came across (via David Bradley) a paper by an MIT group regarding the desalination of water using a very clever application of solubility behavior:
Anurag Bajpayee, Tengfei Luo, Andrew Muto and Gang Chen, Energy Environ. Sci., 2011 Very low temperature membrane-free desalination by directional solvent extraction (article, summary)
The technique simply involves the heating of saltwater with molten decanoic acid to 40-80 C. Some water dissolves into the decanoic acid, leaving the salt behind. The layers are then separated and, upon cooling to 34C, sufficiently pure water separates out. Any traces of decanoic acid are inconsequential since this compound is already present in many foods at higher levels.

From a technological standpoint, I can't think of a reason why this solution could not have been discovered and implemented 100 years ago. It makes you wonder how many other elegant solutions to real problems could be uncovered by connecting the right pieces together.

To me, this is where the efforts of Open Science and the automation of the scientific process will pay off first. For this to happen on a global level, two key requirements must be met:
1) Information must be freely available, optimally as a web service (measurements if possible - otherwise a predicted value, preferably from an Open Model)
2) There has to be a significantly automated way of identifying what is important enough to be solved.
Since we have been working on fulfilling the first requirement for solubility data, I first looked at our available services to see if there was anything there that could have pointed towards this solution.

Although we have a measured (0.0004 M) and predicted (0.001 M) room temperature solubility of decanoic acid in water, our best prediction service can't do the opposite: the solubility of water in decanoic acid. For that we would need the Abraham descriptors for decanoic acid as a solvent and those are not yet available as far as I'm aware.

Also, we use a model to predict solubility at different temperatures - but it assumes that the solute is miscible with the solvent at its melting point. This is probably a reasonable assumption for the most part but it fails when the solute and the solvent are too radically dissimilar (e.g. water/hydrophobic organic compounds). In this particular application, decanoic acid melts at 31C and the process occurs in the 34-80 C range.

But even if we had the necessary models (and corresponding web services) for the decanoic acid/water/NaCl system, could it have been flagged in an automated way as being potentially "useful" or even "interesting"?

For utility assessment, humans are still the best source. Luckily, they often record this information tagged with common phrases in the introductory paragraphs of scientific documents. (In fact, this is the origin of the UsefulChem project). For example, if we search for "there is a pressing need for" AND solubility in a Google search, most of the results provide reasonable answers to the question of what a useful application of solubility might be. I have summarized the initial results in this sheet.

The first result is:
"there is a pressing need for new materials for efficient CO2 separation" from a Macromolecules article in 2005. The general problem needing solving would correspond to "global warming/CO2 sequestration" and the modeling challenge would be "gas solubility".

Analyzing the first 9 results in this way gives us the following problem types:
  1. global warming/CO2 sequestration
  2. fire control
  3. global warming/refrigeration fluid
  4. AIDS prevention
  5. Iron absorption in developing countries
  6. agriculture/making phosphate from rock bioavailable
  7. water treatment/flocculation
  8. natural gas purification/environmental
  9. waste water treatment
and the following modeling challenges:
  1. gas solubility
  2. polymer solubility
  3. hydrofluoroether solubility
  4. solubility of drug in gels
  5. inorganics
  6. inorganics/pH dependence of solubility
  7. polymer solubility/flocculation/colloidal dispersions
  8. gas solubility
  9. inorganics
These preliminary results are instructive. The problem types are broad and varied - and I think they will be helpful for keeping in mind as we continue to work on solubility. The modeling challenges can be compared directly with our existing services - and none of them overlap at this time! All of these involve either gasses, polymers, gels, salts, inorganics or colloids while our services are strictly for small, non-ionic organic compounds in liquid solvents.

Part of the reason for our focus on these types of compounds relates to our ulterior objective of assessing and synthesizing drug-like compounds. But a more important consideration is what type of information is available and what can be processed related to cheminformatics. Currently most cheminformatics tools deal only with organic chemicals, with essential sources such as ChemSpider and the CDK providing measurements, models, descriptors, etc.

Even though some inorganic compounds are on ChemSpider, most of the properties are unavailable. Consider the example of sodium chloride:

This doesn't mean that the situation is hopeless but it does make the challenge much more difficult. Solubility measurements and models for inorganic salts do exist (for example see Abdel-Halim et al.) but they are much more fragmented.

With the feedback we obtain from this search phrase approach - and hopefully help from experts in the chemistry community - we can piece together a federated service to provide reasonable estimates for most types of solubility behavior.

I think that this desalination solution will prove to be a good test for automated (or at least semi-automated) scientific discovery in the realm of open solubility information. In order to pass the test, the phrase searching algorithm should eventually identify desalination as a "useful problem to solve" and should connect with the predicted behavior of water/NaCl/decanoic acid (or other similar compound).

Luckily we have Don Pellegrino on board. His expertise on automated scientific discovery should prove quite valuable for this approach.


At 2:13 AM, Blogger Egon Willighagen said...

Great post!

Have you considered setting up a competition like the solubility measuring, but then for missing CDK pieces? Like the Abrahams' descriptors?

The fact that I don't have time to work on these descriptors right now, doesn't mean I cannot assist in supervising students working on such CDK code, as I support the community like this already...

At 6:00 AM, Blogger Jean-Claude Bradley said...

Egon - we don't have any prize money for missing CDK pieces, like we do for solubility (via RSC currently). But of course there is an "open crowdsourcing call" for anyone to contribute on anything we work on. I try to talk to my students about these things as I learn their interests and skills. I will definitely ask them to help with the phrase searching and identifying modeling challenges and problem types (maybe we can even develop a formal ontology for these). I appreciate all your help so far!


Post a Comment

<< Home

Creative Commons Attribution Share-Alike 2.5 License