Tuesday, January 30, 2007

Sub-Structure and Similarity Searches on UsefuChem Molecules

Sub-structure and similarity searches can now be performed on the Useful Chemistry Molecules database. The relevant links are http://showme.physics.drexel.edu:8080/SubStructureSearch/
and http://showme.physics.drexel.edu:8080/SimilaritySearch/.

These web applications use Java servlet technology, running on Apache Tomcat web server. The code for the sub-structure search can be downloaded from here, and the code for the similarity search can be downloaded from here.

Both web applications are additional examples of what can be done with open-source software packages for cheminformatics. The sub-structure search uses the babel command-line processor to extract entries from the usefulchem database, using the SMARTS string entered on the form. The similarity search app uses the CDK library to perform Tanimoto coefficient calculations on the entered SMILES string versus the database.

C&E News Article on Chemistry Blogs

Bethany Halford just posted an article "Bloggers Anonymous" in Chemical and Engineering News on chemistry blogs. Just another indicator of how blogging is taking shape in the mainstream scientific consciousness. Two of our blogs, UsefulChem and UsefulChem-Molecules got listed.

Sunday, January 28, 2007

Automated Reaction Kinetics using Excel VBA and JCAMP

Dave has created the required files and code to substantially automate the integration processing of NMR files in order to extract kinetic information. Two weeks ago I suggested that we use CML to do this but for now it is a bit more convenient to use JCAMP BLOCK files.

Here is the procedure:

1) Uncompress the JCAMP jdx files obtained from the NMR by loading them one by one in the standalone version of JSpecView and exporting as JCAMP XY format. In the TITLE field (first line) replace the default name with the number of minutes after the addition of the last reagent or the name of the starting material.

2) Create a new txt file and affix a .jdx extension. At the start of the file insert the following lines:
##TITLE= Reaction Profile Title
##JCAMP-DX= 5.01
##DATA TYPE= LINK

and at the end of the file add:
##END= $$end of BLOCKs

3) In between these lines insert the uncompressed JCAMP files created above. Empty lines to separate the spectra are ok.

4) Download and extract the following zip folder and modify the JCAMPNMR.ini file according to the number and location of ranges. The integration ranges should be selected carefully to ensure that the peaks will not drift out of range while being narrow enough to minimize baseline drift issues. (Leave the Frequency at 500 for now even if using the 300 MHz instrument - it works like it is.)

5) Open the JCAMPNMR.xls Excel workbook in the same folder as the ini file. You should have your Excel security settings set to allow macros. If it is working ok Excel will ask you to choose a file. Browse to the BLOCK file created above. Give it a few minutes to process and it should produce a table of all integrations for the ranges set in the ini file for all spectra in the BLOCK file.

The exp054ImineProfile.xls file included in the zip folder is an example of a processed BLOCK file with 11 monitoring spectra and 2 starting materials for the formation of an imine in EXP054. The plot in the file shows the decrease in the concentration of the aldehyde and amine as the imine is formed. It appears that 50% excess amine was added in this experiment thus the aldehyde concentration goes to zero while the amine ends at 250 mM. The assumption used in this plot is that the only significant process occurring here is the formation of the imine.

So there is still some thinking and common chemical sense required but the researcher does not have to manually perform the integrations of several peaks on 13 spectra, which I can attest is tedious and prone to error. I am hoping that providing these data openly in such a format will encourage collaboration and further automated processing. For example I am a bit rusty on fitting second order kinetics to reactions with different starting concentrations of reactants. Anybody want to help with that?

The resulting workbook also contains each spectrum with integration as a separate worksheets. There appears to be more drift in the baseline than observed with JSpecView so I'll have to contact Robert Lancashire to see if there is some trick to that. We are just doing a simple summation of data points. That is why there are small negative numbers in the integration dataset.

In principle the BLOCK files should be viewable with JSpecView as overlays. However, with 13 spectra, that maxes out the Java memory, even with the enhancement to 200 Meg. If some spectra are deleted the overlay should work at some point, depending on the computer's Java memory.

The most time consuming task now is uncompressing the JCAMP files. If we learn how to do that automatically and if the code could read the time off of the filenames and write in the titles, then creating the BLOCK file could be as simple as dragging a bunch of jdx files in a folder and clicking a button to activate an Excel macro.

At some point if we (or anyone) can add intelligent peak picking, we may be able to have some bots analyze the reaction course of ANY reaction profile as a routine matter and flag kinetic behaviors of potential interest.

Thursday, January 25, 2007

Bill Hooker on Open Science: Applications

The future of science is open Part 3: An Open Science world on 3QuarksDaily is a must-read for anyone involved in Open Science:

In Parts one and two, I talked about the scholarly practice of Open Access publishing, and about how the central concept of "openness", or knowledge as a public good, is being incorporated into other aspects of science. I suggested that the overall practice (or philosophy, or movement) might be called Open Science, by which I mean the process of discovery at the intersection of Open Access (publishing), Open Data, Open Source (software), Open Standards (semantic markup) and Open Licensing.

Here I want to move from ideas to applications, and take a look at what kinds of Open Science are already happening and where such efforts might lead. Open Science is very much in its infancy at the moment; we don't know precisely what its maturity will look like, but we have good reason to think we'll like it.


UsefulChem on ManyEyes

There has been some discussion lately on using free and hosted services for scientific data. On an Element List post Swivel was mentioned. Deepak commented on that post about ManyEyes and I put up some of our data from EXP046 on the imine formation and last step in the Ugi reaction. Here is a sample scatter plot.

1) One limitation of Swivel is that the original data is not available, only the plots. ManyEyes does not have that limitation.
2) As I have done above, it is possible to link to specific views or data in ManyEyes but the user can quickly run a different visualization on the fly just by picking a graph type and variable. This is really where I see the usefulness - that it should be possible to quickly get new insights on experiments by viewing the relationships between variables that one would normally not bother to plot and upload.
3) Make sure to create a dummy first column called something like observations - ManyEyes does not include the first column in the list of variables.
4) It is possible to show the relationship of 3 variables - x, y and point size.
5) Scatter plots do not currently allow multiple lines, trendline calculation or zooming in - that would be really useful.

Tuesday, January 23, 2007

JSpecView Java Memory Issue

One of the puzzling issues we have run into while trying to extend our use of JSpecView overlays is that some of the overlays could be viewed on some computers but not others. For the computers that could not view the overlays, this could be corrected by removing a few of the spectra in the BLOCK file, suggesting a memory issue.

Khalid investigated this and not only confirmed that the problem was related to the Java memory setting but also found a program (SetJavaMemory) that fixed the problem for several of the computers with one of the troublesome overlays.

On the program's page is also a link to a Java Memory test. After running the program my memory increased from 67M to 187M.

Even with this increase, clearly there is a fairly low limit to the number of spectra that JSpecView can display simultaneously. Ideally, we would like JSpecView to dynamically display only a few of a large collection of spectra that would not have to be all held in Java memory at the same time.

One hack that would probably work in the short term is to use Excel VBA to easily generate BLOCK files from a large menu of JCAMP or CML files on disk. We could then use the standalone version of JSpecView to quickly look at the overlays from the BLOCK files.

Monday, January 22, 2007

Back From NC Science Blogging Conference

The North Carolina Science Blogging Conference turned out to be a much needed opportunity for physically meeting a lot of the people that have only interacted online. Most notably, I finally got to meet Bill Hooker, author of Open Reading Frame and a strong supporter of the open science movement. We discussed concrete ways of collaborating and I look forward to continuing the discussion online.

Based on the discussion during my Open Source/Open Notebook Science session, there appeared to be significant interest in ways of doing science more openly and of understanding the consequences of doing so. The typical issues came up: intellectual property, recognition, archiving and getting scooped. I had planned on updating a wiki page with ideas generated from the session in a way that Dave Warlick had done at PodcasterCon last year. However, I found that there was not enough time to do that and engage in the discussion. Next time I'll try asking someone to take notes, like Dave did.

My presentation was recorded and is available here.

Technorati tag:

Wednesday, January 17, 2007

Science Blogging Conference in 3 Days

There are still a few more seats available for the Science Blogging Conference in Chapel Hill, NC this Saturday Jan 20, 2007.

Here is a rough agenda for my breakout session on Open Notebook/Open Source Science:

This session will cover the dissemination of primary scientific information via blogs, wikis and other non-traditional vehicles.

Types of information.

* raw experimental data (Open Notebook Science)
* analyzed data
* hypotheses
* “failed” experiments
* generalized protocols
* traditional article format

Issues.

* Intellectual Property
* Referencing and claims to priority
* Academic Validation
* Peer Review – mandatory and elective

Opportunities.

* Increasing productivity in terms of universally usable knowledge units
* Making explicit the nature and quantity of work in collaborations
* Using semantically rich formats and automation at zero publication cost – is this the way to the technological singularity?

The Misbehaving Isonitrile

For some time now we have been keen on using 2-morpholinoethyl isonitrile in our Ugi reaction attempts. The main advantage is that it does not stink, as do most isonitriles. We were encouraged by the report of its successful use (over 60% yield) in at least one Ugi reaction (Harriman 1997).

In the process of trying to debug our reactions we mixed it only with boc-glycine and found that it was consumed within minutes. (EXP049) The isonitrile functionality is particularly convenient to track by NMR because the two bond H-N and one bond C-N couplings provide characteristic triplets with peaks of equal height. In the case of the methylene group next to the isonitrile, this shows up as a pretty triplet of triplets (see below).

Unfortunately the reaction seems essentially intractable but the NMRs are available if anyone has a hypothesis to test. The tertiary amine has been reported (Polyakov 1983) to cyclize to a spiro structure by attacking the carbon end of the isonitrile in the presence of tosic or hydrochloric acid. But I would not think that a carboxylic acid like boc-glycine is strong enough for that and, even if it did, the NMRs would be much simpler if that were the dominant process.

This behavior does seem limited to this particular isonitrile, since Khalid just showed that benzyl isonitrile is stable in the presence of boc-glycine in methanol (EXP050) for at least a day.

I guess that means back to the stink...




Tags
InChI=1/C7H12N2O/c1-8-2-3-9-4-6-10-7-5-9/h2-7H2
2-morpholinoethyl isonitrile

Monday, January 15, 2007

Egon's Delicious Feed

I have been meaning to post about this for a while. Egon has a del.icio.us feed of items that interest him. Very high signal to noise for cheminformatics, open science and related issues.

I particularly liked the Second Life/Real Life interface post this morning. This may be another pathway to open scientific experimentation.

Sunday, January 14, 2007

CML and Reaction Monitoring

Egon's quick follow-up on my previous post about automating the analysis of our NMR monitoring spectra is prompting me to get more specific.

Here is a file that contains the 4 and 13 minute H NMR scans after mixing veratraldehyde and 5-methylfurfurylamine in methanol-d4 (EXP046). The CML was generated using JSpecView and just appended. In lieu of a better solution I have put the time information in the title. For example, EXP046A004min and EXP046A013min. We can use VBA in Excel to start crunching numbers with that.

If there is a better way to express "4 minutes after mixing these 2 compounds" then I would like to adopt it to make our reactions as semantically rich and as universally usable as possible. Egon suggested CMLReact would be useful but we need a little help to get started on the right track.

Peter, any advice?

Tags
InChI=1/C6H9NO/c1-5-2-3-6(4-7)8-5/h2-3H,4,7H2,1H3
5-methylfurfurylamine
InChI=1/C9H10O3/c1-11-8-4-3-7(6-10)5-9(8)12-2/h3-6H,1-2H3
veratraldehyde

Anatomy of a Ugi reaction

Further extending our use of JSpecView to monitor our reactions by NMR, we have been able to generate good kinetics information for the first step of the Ugi reaction, when an imine forms.

We learned that imine formation is about ten times faster in methanol than in chloroform and that our aromatic aldehydes react about 3 orders of magnitude slower than phenylacetaldehyde. Furthermore, phenylacetaldehyde suffers from so many side reactions on the time scale of our experiments that we probably won't revisit it until we are comfortable with the well behaved aromatic aldehydes.

Even though we have not yet obtained our Ugi products, this information certainly could be leveraged for other purposes. In fact there is so much information in these NMR monitoring spectra that it reminds me of the situation in bioinformatics where large volumes of data (e.g. genomic) are generated and shared for unlimited analysis by the larger community.

In a typical organic chemistry paper, the experimental section only reports on the quantity of starting materials, a summary of the reaction conditions, the purification and the yield and characterization of the final products. Additional information is usually omitted if it does not fit into the arguments of the paper deemed essential by the author and reviewers.

In order to be able to understand a reaction it is necessary to be able to dissect and interrogate as much of its information space as possible. A low yield for a particular reaction is very little to go on when it is not possible to test hypotheses of possible side reactions.

A good example of a surprising (to me at least) insight on our attempts at the Ugi reaction is shown in the image below (from EXP046). This is a plot of the concentration of imine and aldehyde after addition of the acid and isonitrile. Even though the imine formation was confirmed to be near complete (85%), addition of the last two components actually reversed the ratio of aldehyde to imine over the following 2 hours, only to be reversed again over the course of several days. The aldehyde concentration did not only change relative to the imine but it actually increased in absolute terms over the first 2 hours. Furthermore, the combined quantity of imine and aldehyde remained steady after the first 2 hours, indicating that if the Ugi product did form to some extent, it did not continue after this initial brief period.

In order to generate this plot, we had to manually read the integration of the aldehyde and imine peaks and relate those to the integration of the entire spectrum. This is tedious and probably fairly prone to human error. However, since the raw data are openly available, I would hope that any significant errors would get flagged at some point.

For that reason, as well as the amount of data being generated, this would seem to be a good place to seriously interface with automation. We are currently representing the NMR spectra as compressed JCAMP files, human readable using the JSpecView applet. What I am thinking is that we could stack them all in one file using CML. The standalone JSpecView has the ability to convert JCAMP to several formats, including CML, although there is no command line interface to do this yet (Robert Lancashire is working on it). We would add the basic necessary metadata, such as the chemicals that were used and the elapsed time for each spectrum.

These reaction profile files could be accessed automatically by anyone for any purpose. The first task that I would like to see performed is the plot of the integration for a given spectral area relative to the integration of the entire spectrum. For example, in the case of the spectra below, the aldehyde profile would be generated from the 9.0-10.0 ppm range and the imine from the 8.0-9.0 ppm range.

I suspect that in the near future it will be easier for organic chemists to just automate the execution of experiments with rich data outputs rather than walk the minefield of copyrighted materials and commercial databases for a few measly data points.



Tags
1/C6H9NO/c1-5-2-3-6(4-7)8-5/h2-3H,4,7H2,1H3
5-methylfurfurylamine

1/C9H10O3/c1-11-8-4-3-7(6-10)5-9(8)12-2/h3-6H,1-2H3
veratraldehyde

Creative Commons Attribution Share-Alike 2.5 License