Sunday, July 25, 2010

General Transparent Solubility Prediction using Abraham Descriptors

Making solubility estimations for most organic compounds in a wide range of solvents freely available has always been a main long term objective for the Open Notebook Science Solubility Challenge. With current expertise and technology, it should be as easy to obtain a solubility estimate as it is now to get driving directions off the web.

Obviously this won't be attained purely by exhaustive measurements, although we have been focused on strategic measurements over the past two years. In parallel, we have been constantly evaluating the various solubility models out there for suitability.

Although there are several solubility models available for non-aqueous solvents, our additional requirement for transparent model building has proved surprisingly difficult to satisfy.

From this search, the Abraham solubility model [Abraham2009] floated to the top, with an important factor being that Abraham has made available extensive compilations of descriptors for solutes and solvents. In addition the algorithms used to convert solubility measurements to Abraham descriptors (a minimum of 5 different solvents per solute) has allowed us to generate our own Abraham descriptors automatically simply by recording new measurements into our SolSum Google Spreadsheet. These can be obtained in real time as well.

This approach permitted us to provide predictions for a limited number of solutes in a wide range of solvents and we have included these predictions in the past two editions (2nd and 3rd) of the ONS Challenge Solubility Book.

Coming at the problem from a different approach, Andrew Lang has also been trying to predict solubility using only open molecular descriptors, mainly relying on the CDK. Since our most commonly used solvent has been methanol, Andy recently generated a web service to predict solubility in that solvent.

By combining these two approaches, Andy has now created a modeling system that can not only generally predict solubility in a wide range (70+) of solvents - but it can also provide related data that can be used for modeling other phenomena such as intestinal absorption of a drug or crossing the blood-brain barrier.[Stovall 2007]

The idea is to use a Random Forest approach to select freely available descriptors to predict the Abraham descriptors for any solute. A separate service then generates predicted solubilities for a wide range of solvents based on these Abraham descriptors. I'm using the term "freely available" because - although the CDK descriptors and VCCLab services are open - the model requires 2 descriptors only available from ChemSpider (ultimately from ACD/Labs).

Here is an example with benzoic acid. As long as the common name resolves to a single entry on ChemSpider, it is enough to enter it and it automatically populates the rest of the fields, which are then used by the service to generate the Abraham descriptors.

Hitting the prediction link above will automatically populate the second service and generate predicted solubilities for over 70 solvents.

This approach of allowing people to access these components separately can be useful. It can be instructive to manually play with the Abraham descriptors directly to see how predicted solubilities are affected. There are also situations where one has experimentally determined Abraham descriptors for a solute and bypassing the descriptor prediction step is required.

However, for those who prefer to cut to the chase, a convenient web service is available where the common name (or SMILES) of the solute is entered and the list of available solvents appears as a drop down menu.

Now here is where I think the real payoff comes for accelerating science with openness. Andy has also created a web service that returns the predicted solubility in molar as a number from common names (or SMILES) for solute and solvent via the URL. For example click this for benzoic acid in methanol. The advantage here is that solubility prediction can be easily integrated as a web service call from intuitive interfaces such as a Google Spreadsheet to enable even non-programmers to make use of the data. Notice that the web service provided in the fourth column for the average of measured solubility values enables an easy way to explore the accuracy of specific predictions.

Such web services could also be integrated with data from ChemSpider or custom systems. If those who use these services feed back their processed data to the open web, it could take us a step closer to automated reaction design. For example consider the custom application to select solvents for the Ugi reaction. Model builders could also use the web services for predicted and measured solubility directly.

A while back we explored using Taverna for MyExperiment to create virtual libraries of SMILES. Unfortunately we ran into issues with getting the applications developed on Macs to run on our PCs. This might be worth revisiting as a means of filtering virtual libraries through different thresholds of predicted solubility.

Andy has described his model in detail in a fully transparent way - the model itself, how it was generated and the entire dataset can be found here. We would welcome improvements of the model as well as completely new models based on our dataset using only freely available tools.

It should be noted that when I use term "general" it refers to the ability for the model to generate a number for most compounds listed in ChemSpider. Obviously compounds that most closely resemble the training set are more likely to generate better estimates. Because of our synthetic objectives using the Ugi reaction we have mainly focused on collecting solubility data for carboxylic acids, aldehydes and amides either from new measurements or from the literature.

Another important point concerns the main intended application of the model: organic synthesis. Generally the range of interest for such applications is about 0.01 - 3M. This might be very different for other applications - such as the aqueous solubility of a drug, where distinctions between much lower solubilities may be important.

For a typical organic synthesis, a solubility of 0.001M or 0.005M will probably translate as effectively insoluble. This might be a desired property for a product intended to be isolated by filtration. On the other end of the scale knowing that a solubility is either 4M or 6M will not usually have an impact on reaction design. It is enough to know that a reactant will have good solubility in a particular solvent.

Given the above considerations for intended applications and the likelihood that the current model is far from optimized, the predictions should be used cautiously. We suggest that the model is best used as a "flagging device". For example, if a reaction is to be carried out at 0.5M, one may place a threshold at 0.4M for the predicted values of reactants during solvent selection, with the recognition that a predicted 0.4M may be an actual 0.55M. A similar threshold approach can be used for the product, where in this case the lowest solubility is desired. A practical example of this is the shortlisting of solvents candidates for the Ugi reaction.

Another example of flagging involves identifying the outliers in the model. These can be inspected for experimental errors and possibly remeasured. Alternatively outliers may shed light on the limitations of the model. For example we have found that the solubility of solutes with melting points near room temperature can be greatly underestimated by the current model. This may be an opportunity to develop other models which incorporate melting point or enthalpy of fusion.[Rohani 2008]

Although it is possible that better models and more data will improve the accuracy of the predictions, this can be true only if the training set is accurate enough. Based on conversations I've had with researchers who deal with solubility, reading modeling papers and our own experience with the ONS Challenge I am starting to suspect that much of the available data just isn't accurate enough for high precision modeling. Models using data from the literature are especially vulnerable I think. Take a look at this unsettling comparison between new measurements and literature values (not to mention the model) for common compounds.[Loftsson 2006] Here is a subset:
I have also made the point in detail for the aqueous solubility of EGCG. Could this be the reason that so many different solubility models using different physical chemistry principles have evolved and continue to co-exist?

The situation reminds me a lot of the discussions taking place in the molecular docking community.[Bissantz 2010] The differences in calculated binding energies are often small in comparison with the uncertainties involved. But docking can still be used as one tool among others to find drug candidates by flagging a collection of compounds above a certain threshold binding energy.

Labels: , , , ,

Wednesday, July 21, 2010

Resveratrol Thesis on Reaction Attempts

A few days ago Andrew Lang suggested to Dustin Sprouse that he submit his thesis to the Reaction Attempts database. Like many undergraduates Dustin put in a lot of time and effort in doing experiments and writing up his results but didn't have quite enough time to obtain all that would have been required for a traditional publication.

A thesis is an unusual document within the context of scientific communication. Unlike a peer reviewed paper, it may contain a large number of "failed experiments" and a substantial amount of speculation. Although it is not quite as detailed as lab notebook, there is often plenty of raw data and details about how failed or ambiguous experiments proceeded.

In Dustin's case we felt that there was enough information provided to include his thesis in Reaction Attempts. In addition, his thesis was accepted by Nature Precedings, thus providing a convenient means of citation.

The first component of the Reaction Attempts project is to quickly abstract the most basic information from synthetic organic chemistry reactions. This includes the ChemSpiderIDs and SMILES from the reactants and target products and brief notes about conditions and outcomes. We are especially interested in failed or ambiguous experiments because these have almost no chance of being communicated and indexed in the traditional systems. When attempting to carry out a reaction, it can be just as useful to know what doesn't work - and more specifically how it doesn't work.

The second component of the project is dissemination. Because the information is encoded semantically, it can be automatically converted to both human and machine readable formats.

One human interface consists of a PDF book (also as a hard copy), with the option of selected reactions specified by listing CSIDs of reactants in the URL. For example Dustin's reactions can be presented selectively here. We also have a Reaction Explorer, where reactants or products can be selected from a dropdown menu or via a substructure search.


We also provide live XML feeds so that others can create applications easily from machine readable data. For example one could create reaction chains automatically, which will occur whenever we enter reactions from multi-step syntheses like Dustin's - based on the synthesis of resveratrol.

I know that Peter Murray-Rust has been very active in automatically abstracting information from chemistry theses. It would be interesting to see how that approach would work for this thesis, especially with the failed experiments. Reducing a page or two of text into only the most salient bits of information manually required a level of judgement that I imagine would be tricky to do automatically.

Labels: , ,

Sunday, July 11, 2010

Secrecy in Astronomy and the Open Science Ratchet

Probably because of the visibility of the GalaxyZoo project, I think several of my colleagues and I have been under the impression that astronomy is a somewhat more open field than chemistry or molecular biology. It was easy to rationalize such a position because patents are not an issue, as they clearly are in fields which rely more on invention than discovery. However, after reading "The Case for Pluto" by Alan Boyle, I am left with a much different impression.

This book does an excellent job of covering the recent debate over Pluto's designation as a true planet. A key trigger for this debate has been the discovery of dwarf planets with sizes very close to that of Pluto. However, these discoveries did not occur without controversy.

The story of the controversy regarding the discovery of Haumea is a particularly good example (starts on p. 108 of the book - a good summary also on Wikipedia). Starting in December 2004 Michael Brown at Caltech discovered a series of new dwarf planets. Instead of immediately reporting his team's discoveries, he worked in secrecy until July 20, 2005 when he posted an online abstract indicating the discoveries would be announced at a conference that September. However, on July 27, 2005 a Spanish team led by José Luis Ortiz Moreno filed a claim with the Minor Planet Center for priority in discovering one of these dwarf planets. This forced Brown's hand in disclosing his team's other discoveries within days - much sooner than he had anticipated.

Apparently this stirred up a great controversy in the community and officially no name was associated with the discovery, although the Spanish team's telescope at Sierra Nevada Observatory was recognized as the location of the discovery. However, Brown was allowed to select the name Haumea for the dwarf planet.

Even though the Minor Planet Center accepted Moreno's submission, most reports seem to side with Brown. The main argument is no less than academic fraud on Moreno's part because he accessed public telescope logs and found some of Brown's data. It was as simple as Googling the identifier that Brown inserted in his public abstract.

If Moreno had hacked into a private computer from Brown's team I can understand fraud. But is it fraud to access public databases? We chemists do that all the time - reading abstracts from upcoming conferences to try to glean what our competitors are up to. That hasn't stopped anyone from submitting a paper or patent.

Secrecy only works if everyone competing follows the same rules. If there is a rule that planet discoveries must be made at conferences or by formal publication then this could not have happened. Moreno's submission to the Minor Planet Center should have been rejected if such a rule existed. If there is a rule that telescope logs should not be accessed then why make them public and indexed on Google?

Now there may exist field specific conventions. I don't know what they are in the case of discoveries such as these but here is an interesting quote from Michael Brown's Wikipedia page:
When asked about this online activity, Ortiz responded with an email to Brown that suggested Brown was at fault for "hiding objects," and said that "the only reason why we are now exchanging e-mail is because you did not report your object."[3] Brown says that this statement by Ortiz contradicts the accepted scientific practice of analyzing one's research until one is satisfied that it is accurate, then submitting it to peer review prior to any public announcement. However, the MPC only needs precise enough orbit determination on the object in order to provide discovery credit, and Ortiz et al. not only provided the orbit, but "precovery" images of the body in 1957 plates.
It seems to me that there is a clash of what are the conventions in the field. Certainly the Minor Planet Center did not recognize the convention of peer review before public disclosure. They only required sufficient proof for the discovery.

One way to look at this story is that Moreno acted more openly than Brown by disclosing information before peer review. This action forced Brown to disclose scientific results much more quickly than he had anticipated.

In a sense this is a type of Open Science Ratchet. The actions of scientists that are most open set the pace for everyone else working on that particular project, regardless of their views on how secretive science should be.

Imagine how the scenario would have played out if one of the groups had used an Open Notebook. On December 28, 2004 everyone with a stake in the search for planets would have had the opportunity to know that a very significant find had been made. There were still details to work out - and the Brown group might not be the first to do all the calculations to completely characterize the discovery. Certainly it would affect what other researchers did - even if they were completely opposed to the concept of Open Science.

Essentially secrecy in this context is an all-or-nothing gamble. Everyone is free to not disclose their work until after peer reviewed publication. In some cases the discoverer will get full credit for the discovery and the complete analysis. But in other cases another group working in parallel will publish first and leave nothing to claim.

As scientists become more open, it is likely that their ability to claim sole priority for all aspects of a discovery will be reduced. However, they will retain priority for the observations and calculations that they made first.

The more open the science, the faster it happens. And because of the Open Science Ratchet, a few Open Scientists scattered across various fields could have a larger hand than expected in speeding up science.

Labels: , , , ,

Thursday, July 08, 2010

Methanol Solubility Prediction Model 4 for Ugi reactions in the literature

Since non-aqueous solubility measurements have not become part of the standard characterization of organic compounds, it is not surprising that all the data we have for Ugi products originate from measurements that we made on our own compounds.

Since methanol is our most common solvent, Andrew Lang has collected the measurements that we have with values from the literature for a range of compounds, including our Ugi products, to generate a web service returning a predicted solubility based on a submitted SMILES string. The model (Model 4) was derived from a Random Forest algorithm, using molecular descriptors supplied by the CDK and VCC.

It would be nice to be able to test the model's ability to predict what will happen if a Ugi reaction is carried out in methanol. Although the actual solubility of Ugi products in the literature is typically not reported, reading the experimental sections in papers can still provide some validation of the model in some cases.

For example, consider the following Ugi products synthesized recently by Lezinska (Tetrahedron 2010)


Note that these images represent the azide group not following the octet rule. It is necessary to represent the structure SMILES without charges because the CDK and VCC web services used by the model do not process charges correctly. Stereochemistry also cannot be used and this can be removed from the SMILES simply by deleting slashes. Thus for the two molecules above the SMILES to be submitted to the prediction web service are:

O=C(NC1CCCCC1)C(Cc2ccc(C)cc2)N(c4ccccc4C(=O)c3ccccc3)C(=O)C(Cc5ccccc5)N=N#N
AND
O=C(NC1CCCCC1)C(C(=O)c2ccccc2)N(Cc3ccc(C)cc3)C(=O)C(C)CCN=N#N

The predicted methanol solubilities are respectively 0.004 M and 0.03 M.

Now if we look at the details in the experimental section, both of these Ugi products were synthesized in methanol at a limiting reactant concentration of about 0.1 M. Even though this is much more dilute than the usual 0.5-2.0 M generally recommended for Ugi reactions (Domling 2000), the products still precipitate and can be filtered off. This is consistent with the predicted solubilities above and the model would have suggested ahead of time that methanol might be a good solvent for isolation of the products by precipitation.

So far these are just anecdotal results but it does illustrate that solubility models can be evaluated without explicit determination of solubility in the literature.

Labels: , ,

Creative Commons Attribution Share-Alike 2.5 License