Glatiramer Acetate Cheminformatics Problem and Fifth ChemInfo Retrieval Class
It started out innocently enough. One of my students picked the multiple sclerosis drug glatiramer acetate for his project in my Chemical Information Retrieval class. This ultimately resulted in the removal of this substance from ChemSpider.
The problem is that this drug is a polymer but it is represented in many places as a simple mixture of acetic acid and 4 amino acids (L-Ala, L-Glu, L-Lys, and L-Tyr). See for example Wikipedia, PubChem and DrugBank.
The SMILES representation is entered as 5 molecules joined by periods:
CC(O)=O.C[C@H](N)C(O)=O.NCCCC[C@H](N)C(O)=O.N[C@@H](CCC(O)=O)C(O)=O.N[C@@H](CC1=CC=C(O)C=C1)C(O)=OThis is probably the source of all subsequent miscalculations - such as a molecular weight of 623.7 (it actually has an average MW one order of magnitude larger), molecular formula C25H45N5O13, Topological Polar Surface Area of 374, Rotatable Bond Count 13, a 3D structure that is nowhere near reality, etc.
Glatiramer acetate is reported to bind to MHC molecules. If these molecular descriptors are used in any type of QSAR analysis this will just add noise to the models.
ChemSpider does not keep track of polymers, except perhaps for some well defined oligopeptides that can be represented by a single SMILES. Consequently it was removed from the database.
It is difficult to apply common cheminformatics tools to this substance. It might be tempting to try to place it in polypeptide/protein databases such as BioPD. But it does not have a well defined length or composition. In fact it is a random co-polymer so it can not even be represented by a repeating structure, such as one might do for polystyrene.
In order to generate meaningful molecular descriptors for QSAR applications I suppose one strategy would be to generate a collection of SMILES representing the average composition of the drug in terms of ratios of amino acids and molecular weights. Each structure would generate molecular descriptors and 3D structures that are far more realistic than those currently listed. Perhaps it would turn out that only some of these polymer structures interact with MHC molecules. (If this has already been done please forgive the oversight - I didn't research this thoroughly. By the end of the term we should know more from the student's report)
The chronological summary of the lecture is as follows:
The fifth Chemical Information Retrieval class on October 22, 2009 started out with covering the new 3D structure viewer introduced recently at PLoS ONE to provide ideas for students doing a multimedia project this term. The current student answers to the chemistry publishing FAQ are then discussed. The reason for removing glatiramer acetate from ChemSpider is explained and a few databases (Wikipedia, PubChem, DrugBank) are visited that still contain the incorrect SMILES, 3D structure and related properties. An overview of an Open Access site (OAD) suggested by Bill Hooker is provided to suggest additional questions for the FAQ. Examples of questions discussed include primary and secondary sources, peer review, article level metrics (a PLoS ONE article on malaria is used as an example), citation searching, Impact Factors and whether one should use one's real name in the blogosphere. Databases Scirus, Web of Science and PubMed are also reviewed.