Wednesday, July 21, 2010

Resveratrol Thesis on Reaction Attempts

A few days ago Andrew Lang suggested to Dustin Sprouse that he submit his thesis to the Reaction Attempts database. Like many undergraduates Dustin put in a lot of time and effort in doing experiments and writing up his results but didn't have quite enough time to obtain all that would have been required for a traditional publication.

A thesis is an unusual document within the context of scientific communication. Unlike a peer reviewed paper, it may contain a large number of "failed experiments" and a substantial amount of speculation. Although it is not quite as detailed as lab notebook, there is often plenty of raw data and details about how failed or ambiguous experiments proceeded.

In Dustin's case we felt that there was enough information provided to include his thesis in Reaction Attempts. In addition, his thesis was accepted by Nature Precedings, thus providing a convenient means of citation.

The first component of the Reaction Attempts project is to quickly abstract the most basic information from synthetic organic chemistry reactions. This includes the ChemSpiderIDs and SMILES from the reactants and target products and brief notes about conditions and outcomes. We are especially interested in failed or ambiguous experiments because these have almost no chance of being communicated and indexed in the traditional systems. When attempting to carry out a reaction, it can be just as useful to know what doesn't work - and more specifically how it doesn't work.

The second component of the project is dissemination. Because the information is encoded semantically, it can be automatically converted to both human and machine readable formats.

One human interface consists of a PDF book (also as a hard copy), with the option of selected reactions specified by listing CSIDs of reactants in the URL. For example Dustin's reactions can be presented selectively here. We also have a Reaction Explorer, where reactants or products can be selected from a dropdown menu or via a substructure search.


We also provide live XML feeds so that others can create applications easily from machine readable data. For example one could create reaction chains automatically, which will occur whenever we enter reactions from multi-step syntheses like Dustin's - based on the synthesis of resveratrol.

I know that Peter Murray-Rust has been very active in automatically abstracting information from chemistry theses. It would be interesting to see how that approach would work for this thesis, especially with the failed experiments. Reducing a page or two of text into only the most salient bits of information manually required a level of judgement that I imagine would be tricky to do automatically.

4 comments:

  1. It might be interesting to describe the system in operation at Imperial College. Candidates submit a (normally Word) document for examination. This is taken, and printed (sic) for the examiners, who receive their copies through the regular post. After examination, and correction, the candidate again submits online, where the now final accepted version is deposited into our
    digital repository. This system has some interesting attributes

    * The only document deposited is PDF, which is really tough to mine for chemistry. In fact, when we tried (a project jointly with Peter), we pretty much gave up, and used the original Word (when it was available). The XML-based .docx format would in fact be so much better for mining, but its use is still patch (most journals do not accept it for example).

    * When the system was introduced about a year ago, neither the PhD candidates nor their supervisors were quite aware of what would happen. One supervisor, idly googling on their topic of expertise, was astonished to find a really relevant hit, only to find it was their own student's thesis! They had been expecting to leisurely write up the work, and think about patents in the window of about 2-4 years that has been the norm in the past

    * If you do go looking at the Imperial Spiral repository for these theses you will not find any from chemistry! The outcry from the above means that the procedure described above is being rethought. Meanwhile, darkness.

    * SORD is a commercial organisation that harvests such theses, extracting reactions and failed reactions possibly, from the content. Dick Wife, who founded SORD, famously wrote about 10 years ago that 80% of the molecules and their reactions described in theses never get published. Think about it; CAS has indexed about 51 million molecules. The real number made may be closer to 200 million! And it could be argued that the real diversity would have been found in those missing 80%.

    ReplyDelete
  2. Thanks for the feedback about SORD Henry! When I looked into it seemed to be focussed on successful reactions only, right? I tried various search terms like acetone, Ugi, benzaldehyde but didn't get any hits - can you give me a search term where I can see how it is organized?

    ReplyDelete
  3. I have experience of working on SORD when I was at ACD/Labs because we supported it in the Web Librarian application...we can discuss..will you be at ACS Boston?

    ReplyDelete
  4. I look forward to your feedback on SORD Tony but I won't be in Boston - lets chat by phone soon though.

    ReplyDelete