Wednesday, July 29, 2009

Iterating a 5D solubility space

About three weeks ago I described how we are mapping a 5D solubility space (mixtures of 4 solvents and temperature). Andrew Lang has been re-running his code to populate the DoSol request sheet with the most useful next measurements. After a few iterations of Marshall Moritz doing experiments and combining with any existing data from the literature we now have 76 measurements for the solubility of 4-nitrobenzaldehyde in mixtures of chloroform, acetonitrile, toluene and THF within the temperature range of -25 to 40 C.

We are now working on ways of quantifying how well we have covered the space and how confident we are of specific predictions. At some point we would like to generalize the predictions based on molecular descriptors of the solvents.

The existing dataset can be sliced in some interesting ways. For example, using Mathematica, Andy has created a plot of the solvent combinations giving the highest possible solubilities of 4-nitrobenzaldehyde at a given temperature. At room temperature this corresponds to a mixture of 38% chloroform and 62% acetonitrile (molar ratio). Below 10C, toluene enters the mix to obtain maximum solubility. At no temperature does THF help.

Labels: ,

Thursday, July 23, 2009

BrightTALK on Open Notebook Science

Today, (July 23, 2009) I presented "Open Notebook Science for Collaborative Drug Discovery" for the BrightTALK series. It was a good opportunity to include some of the new content on bots that I was not able to record at the AI conference I attended in Pasadena a few weeks ago due to a computer crash.

The screencast is already available. They have a nifty viewer where you can hover over the timeline to get a preview of slide under discussion.

The slides are also available:

Labels: ,

Wednesday, July 22, 2009

CombiUgi virtual library generation via Google Spreadsheet

Andrew Lang has just created a service that lets anyone create a virtual library of Ugi products by entering the SMILES of the starting materials in a Google Spreadsheet.

First copy this template sheet (you must use File -> create a copy - copying and pasting cells will not work). Then publish the Google Spreadsheet under the Share tab.

Next add the key of your new Spreadsheet (as it appears in the URL) to a URL of this form:

The resulting page, which could take a long time to load for large libraries, can then be saved as a CSV file. On Firefox this is done by selecting Save As Text File.

If you put a CSV extension in the name you can then open the file directly in Excel:

All the results are in SMILES format and using all the tricks of Excel can be sorted or filtered even by reactant. One could also copy and paste to another Google Spreadsheet to manipulate the dataset.

This service replaces the one Rajarshi Guha had set up a while back at Indiana University. A key difference with this service is that it requires SMILES to be constructed as shown in the template sheet:
  • N to the left for amines
  • C(=O)O to the right for carboxylic acids
  • O=C to the left for aldehydes
  • [C-]#[N+] to the left for isocyanides
This requires a little knowledge of SMILES, especially for aromatic rings. I left a few examples with polysubstituted aromatics to show how this is done.

Labels: , ,

Wednesday, July 15, 2009

Report from IJCAI09 conference - ONS and AI

On July 12, 2009 I presented a talk at the IJCAI'09 Workshop on Abductive & Inductive Knowledge Development in Pasadena, CA. While I was waiting to speak my computer permanently failed to reboot so I was not able to record my talk. Luckily I did have a copy on a flash drive and the slides are available on SlideShare.

The workshop had a mix of theoretical and practical talks about using automated reasoning. Lorenzo Magnani gave an interesting view of "manipulative abduction", where the thinker creates a hypothesis from interacting physically with the world without a preconceived plan. That generated some more discussion during the panel session and the question of whether a machine is capable of such activity was explored but not resolved. It sounded to me a lot like play.

Other talks that I particularly enjoyed included Deborah Chasman's presentation of using abductive logic programming in bioinformatics to understand how the Brome Mozaic Virus infects and replicates. Bassel Habib demonstrated the reconstruction of Claude Bernard's curare experiments on frogs from his original lab notebooks. He noted that Bernard did not usually explicitly state his hypotheses in his notebook - perhaps he was operating on manipulative abduction...

The most interesting talk for me was certainly Oliver Ray's presentation. He described his work on writing the logic behind Ross King's robot scientist. Basically the code creates hypotheses, checks them against the data obtained from the robot's experiments then reports those that seem to be valid. In this application the robot was trying to work out the metabolic pathways of yeast and some of the strongest hypotheses were strange. For example, indole was correlating with yeast growth in an expected way. It turned out that tryptophan was a contaminant in the indole supply. This was really a great example of how machine intelligence can amplify the ability of humans to reason. I'm hoping that Oliver can collaborate with us to apply similar tools to our solubility and synthesis projects.

Labels: , ,

Friday, July 10, 2009

Spectral Game paper live on the Journal of Cheminformatics

Our paper on the Spectral Game is now published:
Jean-Claude Bradley, Robert J Lancashire, Andrew SID Lang and Antony J Williams The Spectral Game: leveraging Open Data and crowdsourcing for education Journal of Cheminformatics 2009, 1:9 doi:10.1186/1758-2946-1-9
This has been an especially gratifying collaboration because of the enthusiasm and vision of my co-authors. The philosophy behind the game is deeply rooted in openness and as a result it is an open ended evolving project. Any new NMR spectra uploaded to ChemSpider and marked as Open Data will continue to be automatically incorporated into the pool of problems. Teachers and students from around the world can play the game and flag problems or errors as they arise. This blurs the line between content creators and consumers and I think reflects a powerful trend that is occurring in education.

Another aspect of openness relating to this endeavor is the communication of our progress. Our paper was written on a public wiki. Not only were we able to discuss our progress on recorded talks and blog posts, but we were also able to cite these as regular references in the paper. And of course the Journal of Cheminformatics is itself an Open Access peer-reviewed publication so there is no limitation to sharing the final product.

Controversy still rages in the blogosphere about the wisdom of blogging research results prior to publication in peer-reviewed journals. It is true that this practice limits where articles can be submitted. Since many of our references are from the Journal of Chemical Education, we contacted the editors to see if they would accept our paper. Unfortunately their current pre-print policy did not allow them to do so.

If more authors begin to see the value of early disclosure it may just start to tip the balance towards journals such as the Journal of Cheminformatics.

Andrew Lang and I have just completed another paper on Chemistry in Second Life - written in the same way - that one just got submitted to Chemistry Central Journal.

Labels: , , ,

Thursday, July 09, 2009

ChemADVISOR promotes ONS Challenge

I was quite pleased to discover this morning that ChemADVISOR has posted a notice about our Open Notebook Science Solubility Challenge on their newsletter.

I had a nice chat with Matt Kaus this afternoon about possible ways we can work together to further our common objectives. This seems to be a win-win situation for many stakeholders, including the students participating in our ONS solubility who are looking for employment opportunities. Our solubility data is also apparently in demand from their subscribers.

Lets see how this plays out but I am certainly excited about the possible projects going forward.

Labels: , ,

Sunday, July 05, 2009

Regression of 5D solubility space and distributed automation

I recently reported on the plotting of a solubility surface in 3D. Marshall Moritz has now extended his measurements of the solubility of 4-nitrobenzaldehyde in 2 more mixed solvent systems (ONSC-EXP114), giving us 4 solvents and temperature. The results are stored in the SolSumMix spreadsheet.

Andrew Lang has performed a quadratic regression analysis of this space and we have pretty good agreement with the experimental data points (see the "predicted solubility" column in the above SolSumMix spreadsheet).

Although we can't easily represent the entire 5D space intuitively, we can take 3D slices of the regression to assess the fit. For example, consider the plot of mol fraction % chloroform vs. acetonitrile keeping other solvents at zero concentration. What we observe is a nice saddle shape similar to the plot we did earlier with the original data points.
Now consider a slice of mol fraction % toluene vs THF keeping the other 2 solvents at zero. For temperatures above about 0 C we observe an expected rise in solubility with temperature. However going below 0 C the curve reverses and solubility is predicted to go up a bit. This is clearly not right and it simply means that we are missing key data points in that area. A quadratic fit will insert parabolic elements giving this inversion. It is very important to understand that these models will probably do fairly well for intrapolating within our experimental range (about -25C to 40C) but will not be very helpful for extrapolating beyond this region.
Since it is difficult to manually inspect every possible slice of this 5D space Andy has created a service that returns recommendations for the most needed points to be measured next to generate a better model. We already have a DoSol spreadsheet that instructs ONS Challenge students as to the most urgent next solubility measurement to make. This additional "bot" (not quite fully automated as of yet but will be soon) integrates nicely with the collection we already have.

We aim to show that such open distributed mechanisms to requests and execute measurements is a viable way to efficiently leverage crowdsourcing to automate parts of the scientific method. If it can be applied to solubility it can be applied to other problems.

Labels: , , ,

Wednesday, July 01, 2009

Marshall Moritz is July09 Submeta ONS Award Winner

Marshall Moritz, a chemistry and math student at Syracuse University, working under the supervision of Jean-Claude Bradley at Drexel University over the summer, is the July 2009 Submeta Open Notebook Science Challenge Award winner. He wins a cash prize from Submeta.

Marshall started out using NMR to measure solubility and recently has made some important contributions to the Challenge by using the sequential precipitation technique to obtain solubilities in different solvent mixtures at various temperatures. See his experiments here:

Three more Submeta ONS Awards will be made during 2009. Submissions from students in the US and the UK are still welcome.
For more information see:

Labels: ,

Creative Commons Attribution Share-Alike 2.5 License