Thursday, November 05, 2009

Sixth Cheminfo Retrieval class: What is the m.p. of strychnine?

It would seem to be a simple task to find the melting point of a well known alkaloid like strychnine. Our quest to answer that question - and other simple properties - in class using both freely available and commercial databases reveals how treacherous it can be. In the end we don't find an unambiguous answer but we uncover enough information for many applications.

The take home message is that chemists need to be constantly paranoid that their information - whether from their lab or the most prestigious journals - can easily be wrong. Strategies such as finding multiple sources and investigating the experimental details provided in the primary sources are demonstrated to diminish uncertainty. But this is often not easy or quick.

Here is a summary of the lecture:

This is the lecture from the sixth Chemical Information Retrieval class at Drexel University on October 29, 2009. It starts with a review of some of the new questions answered by students from the chemistry publishing FAQ, which covers patent information and accessing electronic journals at Drexel. Tony Williams submitted a puzzle to resolve conflicting structures in ChemSpider, which is too difficult to be a regular assignment. It requires re-analyzing spectroscopic data in papers where stereochemical assignments are determined. An example is paromomycin which has three entries. The regular assignment for the week is then introduced and it involves obtaining 5 different sources each for 5 different properties for a molecule of the student's choosing. To demonstrate how to do the assignment strychnine is chosen as an example. Melting point information is obtained from ChemSpider (ultimately an MSDS sheet), Wikipedia, Wolfram Alpha and in a JACS article via SciFinder. By investigating primary sources several errors are found in SciFinder, where the recorded melting points correspond to salts of the alkaloid. Difficulties in finding primary sources for the melting point from Wikipedia are highlighted. For LD50 information Wikipedia did not even provide proper units (mg instead of mg/kg and no animal or route specified). The importance of ChemSpider predicted values for density and boiling point is demonstrated as a corroborating tool. In the end the reported melting point range of strychnine from the JACS paper did not even overlap with the reference to which it was compared. The exercise is meant to highlight the importance of caution in obtaining values from all available sources. Even the seemingly simple question of determining the melting point of well known alkaloid cannot be answered definitively.

Labels: , , , , , ,

Wednesday, November 04, 2009

Glatiramer Acetate Cheminformatics Problem and Fifth ChemInfo Retrieval Class

It started out innocently enough. One of my students picked the multiple sclerosis drug glatiramer acetate for his project in my Chemical Information Retrieval class. This ultimately resulted in the removal of this substance from ChemSpider.

The problem is that this drug is a polymer but it is represented in many places as a simple mixture of acetic acid and 4 amino acids (L-Ala, L-Glu, L-Lys, and L-Tyr). See for example Wikipedia, PubChem and DrugBank.


The SMILES representation is entered as 5 molecules joined by periods:
CC(O)=O.C[C@H](N)C(O)=O.NCCCC[C@H](N)C(O)=O.N[C@@H](CCC(O)=O)C(O)=O.N[C@@H](CC1=CC=C(O)C=C1)C(O)=O
This is probably the source of all subsequent miscalculations - such as a molecular weight of 623.7 (it actually has an average MW one order of magnitude larger), molecular formula C25H45N5O13, Topological Polar Surface Area of 374, Rotatable Bond Count 13, a 3D structure that is nowhere near reality, etc.

Glatiramer acetate is reported to bind to MHC molecules. If these molecular descriptors are used in any type of QSAR analysis this will just add noise to the models.

ChemSpider does not keep track of polymers, except perhaps for some well defined oligopeptides that can be represented by a single SMILES. Consequently it was removed from the database.

It is difficult to apply common cheminformatics tools to this substance. It might be tempting to try to place it in polypeptide/protein databases such as BioPD. But it does not have a well defined length or composition. In fact it is a random co-polymer so it can not even be represented by a repeating structure, such as one might do for polystyrene.

In order to generate meaningful molecular descriptors for QSAR applications I suppose one strategy would be to generate a collection of SMILES representing the average composition of the drug in terms of ratios of amino acids and molecular weights. Each structure would generate molecular descriptors and 3D structures that are far more realistic than those currently listed. Perhaps it would turn out that only some of these polymer structures interact with MHC molecules. (If this has already been done please forgive the oversight - I didn't research this thoroughly. By the end of the term we should know more from the student's report)

The chronological summary of the lecture is as follows:

The fifth Chemical Information Retrieval class on October 22, 2009 started out with covering the new 3D structure viewer introduced recently at PLoS ONE to provide ideas for students doing a multimedia project this term. The current student answers to the chemistry publishing FAQ are then discussed. The reason for removing glatiramer acetate from ChemSpider is explained and a few databases (Wikipedia, PubChem, DrugBank) are visited that still contain the incorrect SMILES, 3D structure and related properties. An overview of an Open Access site (OAD) suggested by Bill Hooker is provided to suggest additional questions for the FAQ. Examples of questions discussed include primary and secondary sources, peer review, article level metrics (a PLoS ONE article on malaria is used as an example), citation searching, Impact Factors and whether one should use one's real name in the blogosphere. Databases Scirus, Web of Science and PubMed are also reviewed.

Labels: , , , , , , , ,

Sunday, February 01, 2009

Maintaining Solubility Data Provenance from Wikipedia to Lab Notebook

I recently noticed a bunch of incoming clicks from the Wikipedia entry for benzoic acid to our Open Notebook Science solubility challenge wiki on the ONSC sitemeter.

It turns out that Andrew Lang added a link from the Properties section covering the solubility of benzoic acid in a few organic solvents.

Instead of pointing directly to the individual notebook pages for each measurement, clicking on the reference takes us to a page summarizing all of the solubility measurements of benzoic acid in various solvents. The values are averaged and a standard deviation is provided.

Clicking on one of these links takes us to a summary page of each measurement for one of the solvents.
Clicking a link from that collection takes us to the laboratory notebook page (for example EXP005) on the wiki and ultimately to a Google Spreadsheet (for example) with all of the calculations for that measurement.

The beautiful thing is that the original url can't be any easier to create:
http://oru.edu/cccda/sl/solubility/allsolvents.php?solute=benzoic acid

And this same link will always link to the best possible values as more measurements are made and erroneous values are removed if found. With Andy's current code, if measurements at different temperatures are made, a plot is provided.

Now we're getting into some interesting territory. It is now so simple to refer to the solubility data of the ONS challenge that people who don't know the first thing about wikis, blogs or coding can start to partcipate in the use of ONS measurements.

The Open Notebook can be thought of like insurance. We don't want to have to use it - but if there is a problem we have to ability to trace the chain of provenance all the way to the source.

By the way notice that this is the only link in the Properties section of this Wikipedia article with a reference...

Labels: , ,

Wednesday, October 08, 2008

Open Notebook Science on Wikipedia

Andy Lang re-created the Open Notebook Science page on Wikipedia a few days ago. Last time we tried over a year ago the page got quickly deleted as being a neologism and got re-routed to the Open Data page.

The page initially got marked for deletion again but this time strong support from the FriendFeed crowd saved us. We still have to work it a bit but I think it should stay. Many thanks to Cameron Neylon, Michael Nielsen, Richard Akerman, Deepak Singh, Bill Hooker, Neil Saunders, Daniel Mietchen and others.

Labels: ,

Creative Commons Attribution Share-Alike 2.5 License