Monday, November 23, 2009

Communicating Chemistry

In October 2008, I participated in an NSF workshop on eChemistry: New Models for Scholarly Communication in Chemistry. Theresa Velden and Carl Lagoze have now published their reports. Here are the details from their press release:
Public Release of White Paper: The Value of New Scientific Communication Models for Chemistry

Ithaca, NY, November 23, 2009 – The results of a National Science Foundation sponsored workshop in October 2008 are now available in a white paper 'The value of new scientific communication models for Chemistry', publicly accessible at An article ʻCommunicating Chemistryʼ,summarizing this white paper, is published in the December issue of Nature Chemistry at

This white paper is intended as a starting point for discussion on the possible future of scientific communication in chemistry, the value of new models of scientific communication enabled by web-based technologies, and the necessary future steps to achieve the benefits of those new models. It opens with an overview of publishing reform and e-science initiatives in other disciplines, such as open access, data publishing, and preprint servers. Following this, it reviews the scientific communication system in chemistry, including the established system of journals and databases, and recent web-based innovations and experiments. Next, it analyzes the distinguishing aspects of chemistry that may influence its communication practices and have an impact on the manner in which science communication in chemistry will further evolve.

The white paper concludes with a call for a more comprehensive symposium on this subject. In recognition that the analysis presented in the white paper is yet incomplete, and provides only a starting point for discussion, the proposed international symposium would engage a broad range of participants who would expand on the subjects introduced in the white paper and issue calls for actions and research initiatives. Work on finding funding for this symposium is now in progress.

Members of the chemistry community and other interested parties are encouraged to join in a critical and constructive assessment of the content of the white paper and the issues it addresses. An online forum for this community discussion has been set up at Other venues for discussion at conferences and workshops are being planned.

Friday, November 20, 2009

CAS curates strychnine m.p. - ChemInfo Class 9

What is going to distinguish chemistry databases as we move forward in this Web2.0 world?

If I was unsure of it when I started teaching Chemical Information Retrieval 2 months ago, I certainly got my answer yesterday afternoon. Cristian Dumitrescu from CAS contacted me to discuss the problems I had encountered when attempting to use SciFinder to find the melting point of strychnine. He had read my blog post and wanted to make sure he understood the problem. So I had a conference call with him and a CAS colleague and I explained that several m.p. values corresponded to strychnine salts instead of the free base. They agreed to rectify the situation.

Apparently Cristian stays on top of what is being said about CAS products from various sources, including the blogosphere. I think that what will distinguish chemistry databases as we move forward is precisely this type of proactivity and responsiveness.

There are a plethora of databases out there to search for chemical information. Most of them contain surprisingly significant amounts of incorrect data. My students are in the process of demonstrating that with their assignment on finding 5 sources for 5 properties of a chemical of their choice. When they are done in 2 weeks I'll post about that, perhaps doing a top 10 worst data points.

CAS is an example of a commercial database. But the same principle applies to free databases as well.

Consider the glatiramer acetate problem I reported on previously. ChemSpider immediately removed the entry because a random polymer was being incorrectly represented as a physical mixture of amino acids. As far as I know no other free databases have corrected the problem, although contact information for people running various databases was provided by Michael Kuhn and Egon Willighagen on FriendFeed.

I spoke with Cristian about the problem and he said he would look into it. Upon doing a search for glatiramer acetate on SciFinder it appears that there is currently a problem. The text correctly explains that this is a polymer but the empirical formula looks like just a physical mixture of amino acids, with an extra H2O per unit that should not be there after amide formation. But this was minor compared to the problems I reported on previously - for example there were no incorrectly calculated molecular properties, although the images did not represent the structure of the polymer.
This has been a good week for curation. Yesterday Nick successfully completed the evaluation of the stereochemistry of nargenicin and submitted the corrected SMILES to ChemSpider. Tony Williams has already incorporated the fix and now a search for nargenicin on ChemSpider gives just one entry.

Tony has provided several such puzzles for my students and a few are close to resolving the structures. The main problem is that the structures were entered into ChemSpider with at least one undefined stereocenter. Finding the correct structure from the primary literature can be very challenging for structures of this complexity but it certainly puts the chemical information retrieval methods I am teaching my students to good use.

The class itself was short - and covered mainly just details of student assignments - since we won't have much time during the last class on December 3, 2009 for a workshop. Rajarshi Guha and Tony Williams will be my guest lecturers on that day.

Labels: , , , , ,

Tuesday, November 17, 2009

Cheminfo Retrieval 8th class

This is the lecture from the 8th Chemical Information Retrieval class at Drexel University on November 12, 2009. It starts with a demonstration of how to use of ChemSketch and Chemspider to display and manipulate chemical structures, especially those with complicated stereochemistry. Technical issues with using SMILES between the two platforms are addressed, as are optimization of 3D structures and inverting chiral centers. Microsoft Paint is used to process screen captures to images that can be uploaded to Wikispaces. ChemSpider is also used to generate predicted properties. SDBS is used to retrieve NMR and other spectroscopic data.

Sunday, November 15, 2009

Mel Reichman's Drug Discovery Talk

Mel Reichman gave an outstanding presentation at Drexel on November 12, 2009. I think many of our faculty and students benefited from his unique perspective on high throughput drug discovery and the story of Vioxx from both chemistry and intellectual property considerations.

Unfortunately the resolution of the screen was changed during the presentation because the projector was not working properly. As a result some of the screen capture video got de-centered. I'm embedding the slides as well to see all the details.
Mel Reichman, senior investigator and director of the LIMR Chemical Genomics Center at the Lankenau Institute for Medical Research presents at the chemistry department at Drexel University on November 12, 2009. Introduction by Jean-Claude Bradley.

Modern drug discovery by high-throughput screening (HTS) begins with testing hundreds of thousands of compounds in biological assays. The confirmed hit rate for typical HTS is less than 0.5%; therefore, 99.5% of the costs of HTS are for generating null data. Orthogonal convolution of compound libraries (OCL) is 500% more efficient than present HTS practice. The OCL method combines 10 compounds per well. An advantage of this method is that each compound is represented twice in two separately arrayed pools. The potential for the approach to better enable academic centers of excellence to validate medicinally relevant biological targets is discussed.

Labels: , , ,

Thursday, November 12, 2009

Liz Lyon on Open Science at web-scale

Liz Lyon from UKOLN has just published a JISC report on Open science at web-scale: Optimising participation and predictive potential. This is a very thorough 45 page document that will serve the Open Science community well as a reference for supporting open initiatives. UsefulChem and Open Notebook Science are covered in a balanced way I think.
This report has attempted to draw together and synthesise evidence and opinion associated with data-intensive open science from a wide range of sources. The potential impact of data-intensive open science on research practice and research outcomes, is both substantive and far-reaching. There are implications for funding organisations, for research and information communities and for higher education institutions.

The original specification for the work was highly selective in its choice of areas to study, and this Report addresses only three of these areas in any depth:

* open science including open notebook science : making methodologies, data and results available on the Internet, through transparent working practices
* citizen science including volunteer computing : where volunteers who may not have scientific training, perform or manage research-related tasks such as observation, measurement or computation
* predictive science : data-driven science which enables the forecasting, anticipation or prediction of specific outcomes.


Wednesday, November 11, 2009

Mel Reichman on Pool Shark’s Cues for More Efficient Drug Discovery

The Drexel Department of Chemistry Seminar Series presents "Pool Shark’s Cues for More Efficient Drug Discovery" on Thursday, November 12, 2009 at 4:30 p.m. in Disque Hall room 109 (32nd Street between Market and Chestnut Streets). Mel Reichman, senior investigator and director of the LIMR Chemical Genomics Center, the Lankenau Institute for Medical Research, is the guest speaker.
Modern drug discovery by high-throughput screening (HTS) begins with testing hundreds of thousands of compounds in biological assays. The confirmed hit rate for typical HTS is less than 0.5%; therefore, 99.5% of the costs of HTS are for generating null data. Orthogonal convolution of compound libraries (OCL) is 500% more efficient than present HTS practice. The OCL method combines 10 compounds per well. An advantage of this method is that each compound is represented twice in two separately arrayed pools. We will discuss results and the potential for the approach to better enable academic centers of excellence to validate medicinally relevant biological targets.

Thursday, November 05, 2009

Sixth Cheminfo Retrieval class: What is the m.p. of strychnine?

It would seem to be a simple task to find the melting point of a well known alkaloid like strychnine. Our quest to answer that question - and other simple properties - in class using both freely available and commercial databases reveals how treacherous it can be. In the end we don't find an unambiguous answer but we uncover enough information for many applications.

The take home message is that chemists need to be constantly paranoid that their information - whether from their lab or the most prestigious journals - can easily be wrong. Strategies such as finding multiple sources and investigating the experimental details provided in the primary sources are demonstrated to diminish uncertainty. But this is often not easy or quick.

Here is a summary of the lecture:

This is the lecture from the sixth Chemical Information Retrieval class at Drexel University on October 29, 2009. It starts with a review of some of the new questions answered by students from the chemistry publishing FAQ, which covers patent information and accessing electronic journals at Drexel. Tony Williams submitted a puzzle to resolve conflicting structures in ChemSpider, which is too difficult to be a regular assignment. It requires re-analyzing spectroscopic data in papers where stereochemical assignments are determined. An example is paromomycin which has three entries. The regular assignment for the week is then introduced and it involves obtaining 5 different sources each for 5 different properties for a molecule of the student's choosing. To demonstrate how to do the assignment strychnine is chosen as an example. Melting point information is obtained from ChemSpider (ultimately an MSDS sheet), Wikipedia, Wolfram Alpha and in a JACS article via SciFinder. By investigating primary sources several errors are found in SciFinder, where the recorded melting points correspond to salts of the alkaloid. Difficulties in finding primary sources for the melting point from Wikipedia are highlighted. For LD50 information Wikipedia did not even provide proper units (mg instead of mg/kg and no animal or route specified). The importance of ChemSpider predicted values for density and boiling point is demonstrated as a corroborating tool. In the end the reported melting point range of strychnine from the JACS paper did not even overlap with the reference to which it was compared. The exercise is meant to highlight the importance of caution in obtaining values from all available sources. Even the seemingly simple question of determining the melting point of well known alkaloid cannot be answered definitively.

Labels: , , , , , ,

Wednesday, November 04, 2009

Glatiramer Acetate Cheminformatics Problem and Fifth ChemInfo Retrieval Class

It started out innocently enough. One of my students picked the multiple sclerosis drug glatiramer acetate for his project in my Chemical Information Retrieval class. This ultimately resulted in the removal of this substance from ChemSpider.

The problem is that this drug is a polymer but it is represented in many places as a simple mixture of acetic acid and 4 amino acids (L-Ala, L-Glu, L-Lys, and L-Tyr). See for example Wikipedia, PubChem and DrugBank.

The SMILES representation is entered as 5 molecules joined by periods:
This is probably the source of all subsequent miscalculations - such as a molecular weight of 623.7 (it actually has an average MW one order of magnitude larger), molecular formula C25H45N5O13, Topological Polar Surface Area of 374, Rotatable Bond Count 13, a 3D structure that is nowhere near reality, etc.

Glatiramer acetate is reported to bind to MHC molecules. If these molecular descriptors are used in any type of QSAR analysis this will just add noise to the models.

ChemSpider does not keep track of polymers, except perhaps for some well defined oligopeptides that can be represented by a single SMILES. Consequently it was removed from the database.

It is difficult to apply common cheminformatics tools to this substance. It might be tempting to try to place it in polypeptide/protein databases such as BioPD. But it does not have a well defined length or composition. In fact it is a random co-polymer so it can not even be represented by a repeating structure, such as one might do for polystyrene.

In order to generate meaningful molecular descriptors for QSAR applications I suppose one strategy would be to generate a collection of SMILES representing the average composition of the drug in terms of ratios of amino acids and molecular weights. Each structure would generate molecular descriptors and 3D structures that are far more realistic than those currently listed. Perhaps it would turn out that only some of these polymer structures interact with MHC molecules. (If this has already been done please forgive the oversight - I didn't research this thoroughly. By the end of the term we should know more from the student's report)

The chronological summary of the lecture is as follows:

The fifth Chemical Information Retrieval class on October 22, 2009 started out with covering the new 3D structure viewer introduced recently at PLoS ONE to provide ideas for students doing a multimedia project this term. The current student answers to the chemistry publishing FAQ are then discussed. The reason for removing glatiramer acetate from ChemSpider is explained and a few databases (Wikipedia, PubChem, DrugBank) are visited that still contain the incorrect SMILES, 3D structure and related properties. An overview of an Open Access site (OAD) suggested by Bill Hooker is provided to suggest additional questions for the FAQ. Examples of questions discussed include primary and secondary sources, peer review, article level metrics (a PLoS ONE article on malaria is used as an example), citation searching, Impact Factors and whether one should use one's real name in the blogosphere. Databases Scirus, Web of Science and PubMed are also reviewed.

Labels: , , , , , , , ,

Creative Commons Attribution Share-Alike 2.5 License