Thursday, January 03, 2008

Modularizing Results and Analysis in Chemistry

Chemical research has traditionally been organized in either experiment-centric or molecule-centric models.

This makes sense from the chemist's standpoint.

When we think about doing chemistry, we conceptualize experiments as the fundamental unit of progress. This is reflected in the laboratory notebook, where each page is an experiment, with an objective, a procedure, the results, their analysis and a final conclusion optimally directly answering the stated objective.

When we think about searching for chemistry, we generally imagine molecules and transformations. This is reflected in the search engines that are available to chemists, with most allowing at least the drawing or representation of a single molecule or class of molecules (via substructure searching).

But these are not the only perspectives possible.

What would chemistry look like from a results-centric view?

Lets see with a specific example. Take EXP150, where we are trying to synthesize a Ugi product as a potential anti-malarial agent and identify Ugi products that crystallize from their reaction mixture.

If we extract the information contained here based on individual results, something very interesting happens. By using some standard representation for actions we can come up with something that looks like it should be machine readable without much difficulty:
  • ADD container (type=one dram screwcap vial)
  • ADD methanol (InChIKey=OKKJLVBELUTLKV-UHFFFAOYAX, volume=1 ml)
  • WAIT (time=15 min)
  • ADD benzylamine (InChIKey=WGQKYBSKWIADBV-UHFFFAOYAL, volume=54.6 ul)
  • VORTEX (time=15 s)
  • WAIT (time=4 min)
  • ADD phenanthrene-9-carboxaldehyde (InChIKey=QECIGCMPORCORE-UHFFFAOYAE, mass=103.1 mg)
  • VORTEX (time=4 min)
  • WAIT (time=22 min)
  • ADD crotonic acid (InChIKey=LDHQCZJRKDOVOX-JSWHHWTPCJ, mass=43.0 mg)
  • VORTEX (time=30 s)
  • WAIT (time=14 min)
  • ADD tert-butyl isocyanide (InChIKey=FAGLEPBREOXSAC-UHFFFAOYAL, volume=56.5 ul)
  • VORTEX (time=5.5 min)

It turns out that for this CombiUgi project very few commands are required to describe all possible actions:
  • ADD
  • WAIT
By focusing on each result independently, it no longer matters if the objective of the experiment was reached or if the experiment was aborted at a later point.

Also, if we recorded chemistry this way we could do searches that are currently not possible:
  • What happens (pictures, NMRs) when an amine and an aromatic aldehyde are mixed in an alcoholic solvent for more than 3 hours with at least 15 s vortexing after the addition of both reagents?
  • What happens (picture, NMRs) when an isonitrile, amine, aldehyde and carboxylic acid are mixed in that specific order, with at least 2 vortexing steps of any duration?
I am not sure if we can get to that level of query control, but ChemSpider will investigate representing our results in a database in this way to see how far we can get.

Note that we can't represent everything using this approach. For example observations made in the experiment log don't show up here, as well as anything unexpected. Therefore, at least as long as we have human beings recording experiments, we're going to continue to use the wiki as the official lab notebook of my group. But hopefully I've shown how we can translate from freeform to structured format fairly easily.

Now one reason I think that this is a good time to generate results-centric databases is the inevitable rise of automation. It turns out that it is difficult for humans to record an experiment log accurately. (Take a look at the lab notebooks in a typical organic chemistry lab - can you really reproduce all those experiments without talking to the researcher?)

But machines are good at recording dates and times of actions and all the tedious details of executing a protocol. This is something that we would like to address in the automation component of our next proposal.

Does that mean that machines will replace chemists in the near future? Not any more than calculators have replaced mathematicians. I think that automating result production will leave more time for analysis, which is really the test of a true chemist (as opposed to a technician).

Here is an example of an analysis module making a simple point, useful to the chemistry community, and linking back to result modules that ultimately link back to the original experiment in the online laboratory notebook:
Context: obtaining precipitates in the CombiUgi project

Ugi reactions in methanol where the solution is supersaturated with Ugi product may give false negatives for precipitation. For example, a Ugi product rapidly crystallized at the 17th hour (RESULT0003) after addition of all reagents, while appearing as a clear solution at the 15th hour (RESULT0002). It is therefore recommended that the vials be submitted to vortexing (15 s) prior to taking a picture.
We'll be recording these analysis and result modules on UsefulChem wiki pages:
We'll be using InChIKeys for compact unambiguous identification of molecules (and convenient indexing in Google) and the terms in this post for action options. Anyone is free to automatically incorporate these in a database, as long as attribution is provided. (If anyone knows of any accepted XML for experimental actions let me know and we'll adopt that.)

I think this takes us a step closer from freeform Open Notebook Science to the chemical semantic web, something that both Cameron Neylon and I have been discussing for a while now.

Labels: , , ,


At 1:22 AM, Blogger Rajarshi said...

A very interesting idea. It sounds like it would benefit from an ontology of chemistry (or may be an ontology for chemical reactions). Such an ontology could be done in RDF, and you'd be able to build a network of concepts - this would be an interesting representation as it would allow you to start from some known step and find other experiments utilizing that step and so on.

At 7:38 AM, Blogger Egon Willighagen said...

It's a start... mind chemical names like "ADD 4743" [1], "Vortex" [2] and "Take 20" [3] :)

Oh, and the InChIKey is not an "unambiguous identification of molecules"... it's close, but not he full thing.


At 9:57 AM, Blogger Jean-Claude Bradley said...

I think it would need an ontology of "experimental actions". From what I understand, "chemical reactions" would involve the reactant to product construct. These result pages do not contain any information about products, just the raw data. The analysis pages involve interpreting these results.

Especially following Peter Murray-Rust's discussion about RDF triples I think that might make sense. You would take a chemical, perform an action then obtain a result (an image, NMR spectrum, etc.) But the action would be a complicated set of steps and I don't know if that would work on a practical level. But certainly I look forward to any suggestions in this direction.

At 10:03 AM, Blogger Jean-Claude Bradley said...

What do you mean by "mind chemical names"?

Under what circumstances would the InChIKey fail? A major flaw with InChI is that long strings are not properly indexed in Google. InChIKey should fix that for any size molecule. The idea is to put a representation of the molecule precise enough that a script could read, and convert to anything else. Unless I am missing something here, I don't expect problems with the script hitting ChemSpider with the InChIKey and generating InChI, SMILES, MW, etc. - especially since we are using ChemSpider to generate the InChIKey in the first place.

At 10:11 AM, Blogger Egon Willighagen said...

Re:"mind chemical names"
... that chemical names might look like operations. Consequently, the example syntax you gave is not directly machine readable. OK, they are rather obscure names :)

The InChIKey is very unlikely (Something like 1 in 10^12 IIRC) to be the same for two compounds, but not impossible. Effectively, you have a pretty high change of understanding what the molecule is, but the InChIKey is not unique. It's better human readable, at the cost of a tiny imprecision. Mind that the odds of having a hash clash (two compounds with the same key) compared to all possible chemical structures in drug like space, is far from zero.

However, one did not find clashes in PubChem, or in any other database so far... nevertheless, not impossible. For archival purposes, I would always use the InChI itself. The InChIKey may be a fair complement for human interaction.

About Googling... I think I notices that Google does not handle the '-' in the key very nicely! Will have to check that...

At 10:14 AM, Blogger Rajarshi said...

Jean-Claude, yes it'd be more precise to state that it should be an ontology of experimental chemistry procedures - which would, at one point, be a part of an overall chemistry ontology.

At 10:50 AM, Blogger Jean-Claude Bradley said...

Well right now we're only using the operations listed in the blog post. The script should recognize ADD then expect either a container or a molecule. To be consistent, we can write it as:
ADD compound(common name=" ", InChIKey=" ", mass=" ")
The common name is just there to make it human readable.

Yes Google has trouble with indexing + and -, which I think are used for cis and trans by the InChI. For our experiment pages this has not been a huge problem because InChIs of large molecules (500 dalton) are still fairly uncommon so doing a Google search of our Ugi product InChIs gives correct, though sloppy, hits most of the time. But going forward that is not good enough so I'm going to put more importance on the InChIKey. The nice thing about it too is that the first part specifies just the connectivity so searches that exclude stereochemistry specifications can be done easily. This is how ChemSpider has links set up from the InChIKey and that is mighty handy.

At 10:56 AM, Blogger Egon Willighagen said...

The combo of name and InChIKey should be unique enough for practical purposes. Also, the new suggested syntax looks much better. I suggest writing a formal grammar, e.g. using EBNF [1].


At 12:21 PM, Blogger ChemSpiderman said...

JC, What you've outlined is all about workflow representation. While at ACD/labs we worked on workflow management ( specifically for the management sample and workflow information with analytical data (read that as NMR, MS, IR, Chrom etc...but images can analytical data too). What you are describing is all about the status of sample at different time points - states and transitions. Then the data associated with those various states at different time points. Been there...done that.

I agree that a standard manner of representing the workflow content would be appropriate.

It may be feasible to utilize Knime or Tavern, with modifications, to deal with this but Egon is probably best positions to comment.

Personally my interest is in investigating Microsoft Workflow Foundation ( for the immediate future. I think that settling on standards for representing the workflow is the challenge, not the tools to do so.

At 5:47 PM, Blogger ChemSpiderman said...

PMR has posted a comment to your post. I have suggested "Since we will be working to support JC through his project I welcome copies of any publications or reports that you can point me to on your approaches. The info above is useful but more is better."

I think it would be good to use his groups work in this area for sure since he/they have clearly invested time in this area so why reinvent the world.

Peter is a big advocate of Openness and as we discussed today your reactions/details would be open so maybe he will offer us the assistance of his experience to help make this happen on ChemSpider.

At 6:53 PM, Blogger ChemSpiderman said...

PMR's provided me a couple of papers and information re CMLReact so that we can review the capabilities. I'll need some time to review and you and I should discuss what cannot be supported in there. An initial review suggests its promising but an extension may be required for your "output"..spectra and images. However, that was a QUICK review.

At 2:36 AM, Blogger Egon Willighagen said...

Taverna serializes workflows in quite detailed XML files. Recently, was set up, which acts as a SourceForge (aka open repository) for workflows. It's got tagging, versioning, etc.

It is perfectly feasible to write custom workflow nodes (steps, processes), like ADD, VORTEX, etc, and therefore save and share the workflow. An example of a plugin with custom nodes is CDK-Taverna (server was down at the time of writing). The source code is in CDK's SVN repository.

Not sure, how serialization in KNIME works... Thorsten, might you comment on that?

At 3:53 AM, Anonymous Anonymous said...

When using INCHIKEYS for defining chemicals, you can easily lookup the existence of their C/O/N/F/P/B/Si-spectra within the NMRSHIFTDB/CSEARCH/SPECINFO/CHEMGATE/KNOWITALL/KNOWITALL_U/NMRPREDICT/NMRPREDICT_ONLINE-collections using the following URL:, where the string 'XXXXXX' has to be replaced by the first part of the INCHIKEY (before the hyphen !)


Crotonic acid has the INCHIKEY 'LDHQCZJRKDOVOX-JSWHHWTPCJ' - the link-collection to its NMR-data
can be therefore found on
If you receive a 'Page not found' - Error, the data are not available within the above mentioned systems. Also upcoming CSEARCH-data will be indexed immediately when available within my internal development environment.

At 2:34 PM, Blogger Jean-Claude Bradley said...

We tried to use Taverna a while back to process lists of SMILES but we ran into a bug on our PCs that didn't exist on the Macs that our collaborators were using. See this thread.

At 2:48 AM, Anonymous Anonymous said...

Regarding KNIME: The workflow structure is saved as a single XML file. However, for loading a workflow you will need more than this single XML file. Each node has its own directory where its settings are stored. The data at the output of each node is saved in a serialized binary format (proprietary, but you may have a look at the source code).

Regarding I haven't looked at it right now. But we are investigating something similar for KNIME nodes and workflows. But nothing is set up yet.

Hope that helps.

At 8:35 AM, Blogger Cameron Neylon said...

Jean-Claude, we definitely need to get you over to the UK to talk to Jeremy Frey and others about some of this. This is actually very close to another ELN developed at Southampton which aims to capture information in a similar kind of way. Its much more structured but has some advantages. I will try to post something to link the two together sometime soon.

At 9:04 AM, Blogger Jean-Claude Bradley said...

Cameron - Yes I spoke with Jeremy about his system to capture experiments a while back. His feedback would be much appreciated and it would be interesting to see if our entire experiment workflows can be represented in his system.


Post a Comment

<< Home

Creative Commons Attribution Share-Alike 2.5 License