Friday, November 20, 2009

CAS curates strychnine m.p. - ChemInfo Class 9

What is going to distinguish chemistry databases as we move forward in this Web2.0 world?

If I was unsure of it when I started teaching Chemical Information Retrieval 2 months ago, I certainly got my answer yesterday afternoon. Cristian Dumitrescu from CAS contacted me to discuss the problems I had encountered when attempting to use SciFinder to find the melting point of strychnine. He had read my blog post and wanted to make sure he understood the problem. So I had a conference call with him and a CAS colleague and I explained that several m.p. values corresponded to strychnine salts instead of the free base. They agreed to rectify the situation.

Apparently Cristian stays on top of what is being said about CAS products from various sources, including the blogosphere. I think that what will distinguish chemistry databases as we move forward is precisely this type of proactivity and responsiveness.

There are a plethora of databases out there to search for chemical information. Most of them contain surprisingly significant amounts of incorrect data. My students are in the process of demonstrating that with their assignment on finding 5 sources for 5 properties of a chemical of their choice. When they are done in 2 weeks I'll post about that, perhaps doing a top 10 worst data points.

CAS is an example of a commercial database. But the same principle applies to free databases as well.

Consider the glatiramer acetate problem I reported on previously. ChemSpider immediately removed the entry because a random polymer was being incorrectly represented as a physical mixture of amino acids. As far as I know no other free databases have corrected the problem, although contact information for people running various databases was provided by Michael Kuhn and Egon Willighagen on FriendFeed.

I spoke with Cristian about the problem and he said he would look into it. Upon doing a search for glatiramer acetate on SciFinder it appears that there is currently a problem. The text correctly explains that this is a polymer but the empirical formula looks like just a physical mixture of amino acids, with an extra H2O per unit that should not be there after amide formation. But this was minor compared to the problems I reported on previously - for example there were no incorrectly calculated molecular properties, although the images did not represent the structure of the polymer.
This has been a good week for curation. Yesterday Nick successfully completed the evaluation of the stereochemistry of nargenicin and submitted the corrected SMILES to ChemSpider. Tony Williams has already incorporated the fix and now a search for nargenicin on ChemSpider gives just one entry.

Tony has provided several such puzzles for my students and a few are close to resolving the structures. The main problem is that the structures were entered into ChemSpider with at least one undefined stereocenter. Finding the correct structure from the primary literature can be very challenging for structures of this complexity but it certainly puts the chemical information retrieval methods I am teaching my students to good use.

The class itself was short - and covered mainly just details of student assignments - since we won't have much time during the last class on December 3, 2009 for a workshop. Rajarshi Guha and Tony Williams will be my guest lecturers on that day.

Labels: , , , , ,

Thursday, November 05, 2009

Sixth Cheminfo Retrieval class: What is the m.p. of strychnine?

It would seem to be a simple task to find the melting point of a well known alkaloid like strychnine. Our quest to answer that question - and other simple properties - in class using both freely available and commercial databases reveals how treacherous it can be. In the end we don't find an unambiguous answer but we uncover enough information for many applications.

The take home message is that chemists need to be constantly paranoid that their information - whether from their lab or the most prestigious journals - can easily be wrong. Strategies such as finding multiple sources and investigating the experimental details provided in the primary sources are demonstrated to diminish uncertainty. But this is often not easy or quick.

Here is a summary of the lecture:

This is the lecture from the sixth Chemical Information Retrieval class at Drexel University on October 29, 2009. It starts with a review of some of the new questions answered by students from the chemistry publishing FAQ, which covers patent information and accessing electronic journals at Drexel. Tony Williams submitted a puzzle to resolve conflicting structures in ChemSpider, which is too difficult to be a regular assignment. It requires re-analyzing spectroscopic data in papers where stereochemical assignments are determined. An example is paromomycin which has three entries. The regular assignment for the week is then introduced and it involves obtaining 5 different sources each for 5 different properties for a molecule of the student's choosing. To demonstrate how to do the assignment strychnine is chosen as an example. Melting point information is obtained from ChemSpider (ultimately an MSDS sheet), Wikipedia, Wolfram Alpha and in a JACS article via SciFinder. By investigating primary sources several errors are found in SciFinder, where the recorded melting points correspond to salts of the alkaloid. Difficulties in finding primary sources for the melting point from Wikipedia are highlighted. For LD50 information Wikipedia did not even provide proper units (mg instead of mg/kg and no animal or route specified). The importance of ChemSpider predicted values for density and boiling point is demonstrated as a corroborating tool. In the end the reported melting point range of strychnine from the JACS paper did not even overlap with the reference to which it was compared. The exercise is meant to highlight the importance of caution in obtaining values from all available sources. Even the seemingly simple question of determining the melting point of well known alkaloid cannot be answered definitively.

Labels: , , , , , ,

Creative Commons Attribution Share-Alike 2.5 License