Saturday, January 02, 2010

Dangerous Data: Lessons from my Cheminfo Retrieval Class

I'm not sure what my students expected before taking my Chemical Information Retrieval class this fall. My guess is that most just wanted to learn how to use databases to quickly find "facts". From what I can gather much of their education has consisted of teachers giving them "facts" to memorize and telling them which sources to trust.
Trust your textbook - don't trust Wikipedia.
Trust your encylopedia - don't trust Google.
Trust papers in peer reviewed journals - don't trust websites.
If I did my job correctly they should have learned that no sources should be trusted implicitly. Unfortunately squeezing useful information from chemistry sources is a lot of work and hopefully they learned some tools and attitudes that will prove helpful no matter how chemistry data is delivered in the future.

I have previously discussed how trust should have no part in science. It is probably one of the most insidious factors infesting the scientific process as we currently use it.

To demonstrate this, I had students find 5 different sources for properties of chemicals of their choice. Some of the results demonstrate how difficult it can be to obtain measurements with confidence.

Here are my favorite findings from this assignment as a top 3 countdown:

#3 The density of resveratrol on 3DMET

Searching for chemical property information on Google quickly reveals the plethora of databases indexed on the internet with a broken chain of provenance. These range from academic exercises of good will to company catalogs, presumably there to sell products. Although it is usually not possible to find out the source of the information, you can sometimes infer the origin by seeing identical numbers showing up in multiple places.

But sometimes the results are downright bizarre - consider the number 1.009384166 as the density of resveratrol from what looks like a Japanese government site 3DMET. First of all no units are given but lets assume this is in g/ml. The number of significant figures is curious and suggests the results of a calculation, perhaps a prediction. In this case the source is from the MOE software. This is clearly a different algorithm from the one used by ACDLabs, which comes in at 1.356 g/ml, much more realistic when put up against all 5 sources:
  • 1.359 g/cm3 ChemSpider predicted
  • 1.36 g/cm3 (20 C) Chemical Book MSDS
  • 1.009384166 3DMed
  • 1.41 g/cm3 (-30.15C) DOI (found with the aid of Beilstein)
  • 1.359 g/cm3 LookChem
#2 The melting point for DMT depends on the language

I have to admit being really surprised by this. Even though I knew that Wikipedia pages in different languages were not exact translations I would have assumed that the chemical infoboxes would not be recreated. Interestingly, the German edition has a reference but I was not able to access it since it is a commercial database. The English edition has no specific references. Here is a list of sources:
#1 Solubility of EGCG in water

This is by far my favorite because it most clearly demonstrates the dangers of the concept of a "trusted source". From the compilation prepared by the student, this paper (Kwang08) reported the solubility of EGCG at 521.7 g/l:

This is from a paper that spent 5 months undergoing peer review with a well respected publisher. Also it appeared recently so one would expect the benefit of the best instruments and comparison with historical values. But even beyond all of this, the numbers are in the opposite order to the point explained in the paragraph. In our system of peer review we don't expect reviewers to verify every data point - but we do expect the text to be evaluated as logically consistent.

Now if we follow the reference provided for this paragraph we find the following paper (Liang06), with this:

We can now see what happened: the 21.7 was accidentally duplicated from the caffeine measurement and appended to the 5 g/l for EGCG. This is a lot more reasonable, even though I am not clear about where that number comes from in this second paper.

We can get some idea of the potential source of this information from the Specification Sheet for EGCG on Sigma-Aldrich:

Notice that this does not state that the maximum solubility of EGCG in water is 5 mg/ml - just that a solution of that concentration can be made. This value is repeated elsewhere, such as this NCI document, which references Sigma-Aldrich:
From here the situation gets muddled. Another search reveals this peer reviewed paper (Moon06), which appeared in 2006:
Expressed in mM this translates to about 2.3 g/l. Clearly this value is inconsistent with the Sigma-Aldrich report of being able to make a clear solution at 5 g/l.

Luckily, in this case we have some details of the experiments:

The measurements were done in triplicate and averaged. Unfortunately this does not reveal any sources of systematic error. One clue as to why these values are contradictory might be the method of dissolution. One hour sonication at room temperature might just not be enough to make a saturated solution for this compound. (Although one might expect the error to lie on the high side because the sample were diluted before being filtered) What would answer this definitively are the experimental details of how the Sigma-Aldrich source prepared the 5 g/l solution. If it went in within a few minutes without much agitation, that would be inconsistent with this hypothesis of insufficient mixing. In that case we would want to look at the HPLC traces in this paper for another type of systematic error.

Unfortunately, the chain of information provenance ends here. Just based on the data provided so far, there is significant uncertainty in the aqueous solubility of EGCG, similar to our uncertainty about the melting point of strychnine.

As long as scientists don't provide - and are not required to provide by publishers - the full experimental details recorded in their lab notebooks, this type of uncertainty will continue to plague science and make the communication of knowledge much more difficult than it need be.

Unfortunately the concept of "trusted sources" is being used as a building block of some major chemical information projects currently underway - WolframAlpha and the chemical infobox data of Wikipedia are prime examples. Ironically, MSDS sheets are listed as a reliable "trusted source" for the infoboxes, when they have been shown to be very unreliable (see my previous post about this with statistics). These are probably one of the most dangerous sources of information because they appear to be trustworthy - coming from chemical companies and the government - and often found on university websites. Combine that with the absence of references or experimental details and the potential for replication of errors is very high and very difficult to correct.

WolframAlpha does have a mechanism to provide information about sources but it requires submitting a reason and personal information.
To see how this works in practice I made a request for the source of an entry with erroneous data - glatiramer acetate:
I submitted this 10 days ago and still don't know the source.

Rapid access to specific sources is important for maximizing the usefulness of databases. Without that it becomes very difficult to assess the meaning of reported measurements and compare with results from other databases.

It is not possible to remove all errors from scientific publication. But that's only a problem when it is difficult to determine that there are errors in the first place because insufficient information is provided.

Scientists can handle ambiguity. If you look at the discussion over the blogosphere concerning the JACS NaH oxidation paper, much of it was constructive. The publication of that paper was not a failure of science. Quite the opposite - we learned some valuable lessons about handling this reagent. As far as I can tell the paper was a truthful reporting of their results.

Where this was a failure lies in the way conventional scientific channels handled the matter. There was no mechanism to comment directly on the website where paper was posted. That would have been the logical place for the community to ask questions and have the authors respond. Instead the paper was withdrawn without explanation.

Labels: , ,

Tuesday, October 20, 2009

Fourth Cheminfo Retrieval class: ChemSpider and Beilstein Databases

Peggy Dominy, our chemistry librarian at Drexel, was kind enough to teach my third class while I was at NERM. She demonstrated RefWorks - including how to copy and paste the proper formats to Wikispaces - and how to use our ILL (Inter-Library Loan) process.

I'm including a recording of the fourth class on Chemical Information Retrieval on Oct 15, 2009 at Drexel University. It starts with some tips on removing formatting from Wikispaces pages, the Drexel Cisco VPN client for accessing paid subscriptions off campus and how to link to a DOI. The first two assignments for the class are then described. The first involves summarizing each paragraph of an article and an option to use AcaWiki is demonstrated. The second involves filling in an FAQ for publishing in chemistry. FriendFeed is then presented as a resource to help answer questions followed by an extensive overview of available information on ChemSpider, covering SMILES, InChIs, InChIKeys, experimental and predicted properties, linked databases and contributed spectra. Finally a demonstration of Beilstein Crossfire/DiscoveryGate is presented with an emphasis on doing substructure searching.

Labels: , , , , , ,

Tuesday, October 06, 2009

Second ChemInfo Retrieval Class

We had our second class on Chemical Information Retrieval on October 1, 2009 (see screencast here). I spent some time on technical aspects of Wikispaces then introduced topics relating to publishing in chemistry - and science in general. This included primary and secondary/tertiary sources, Open Access, copyright and Web2.0. The associated wiki page is currently just an outline and will get filled with details as students do assignments.

The student projects are coming along nicely. They have been recording the progress of their research on log pages, which I'm finding useful to give feedback. Among the interesting projects being fleshed out are green tea, DMT and trace amine receptors, caffeine, cytochrome C and liposome binding, beer and the psychoactive ingredients in chocolate. I love classes where the instructor learns as much as the students!

If anyone has suggestions for good information sources on these topics please feel free to leave comments.

Labels: ,

Friday, September 25, 2009

Cheminfo Retrieval First Class FA09

I gave my first lecture yesterday (Sept 24, 2009) for my Chemical Information Retrieval course at Drexel. One of my main objectives for the course is to provide the most current information about how to best find and review chemical information.

To this end, I set up a wiki (http://getcheminfo.wikispaces.com) which should become considerably enriched over the course of the term. I invited students to help contribute useful links to the resource page - and even before I finished giving the first lecture they added several really good ones. I also invite any chemists or librarians to add links to resources we may have missed. Just request to join the wiki to contribute.

The wiki will also be used for students to write a report on a chemical topic making use of cheminfo resources. Right after the lecture I made sure the students joined the wiki and created two pages: one for their report and one for a "research log". The idea is that students will report significant steps in conceptualizing their projects and how they are searching databases. I can then comment directly on their log pages for quick guidance. I suppose anyone with helpful suggestions that I missed could also comment - again just request an account on the wiki.

This class has traditionally required a written report. This term I'm adding a twist: the minimum number of words can be reduced somewhat if students elect to incorporate a multimedia or other creative component. To provide examples of what that might look like I visited Drexel Island on Second Life and demonstrated 3D molecules, interactive NMR spectra and a chemistry museum (from Sandy Adam). There is a lot of chemistry possible on Second Life (see Lang & Bradley) At the end of the tour on the island we visited a wildlife area recently built by Robert Brulle for a project related to environmental science (more on this in a later post). I got a hug from a panda and got sprayed when I tried to pet a skunk - just to give a taste of what kind of fun things can be constructed in a virtual world. Other projects could involve screencasts, Jmol, games, Facebook, etc. As long as it requires students to access chemical information, I am pretty open to ideas. Students will work through their ideas on their log page and the final product will also be available on the wiki. These projects could provide interesting examples for others interested in the topic of chemical information.

At the end of my lecture I provided a brief overview of the NaH oxidation controversy. There really could not be a better example of the importance of staying on top of new communication channels to follow and participate in chemical research. This year the most important of these new tools are probably blogs, wikis and FriendFeed. Next year it might be something else - Google Wave?


Labels: , , , , , ,

Creative Commons Attribution Share-Alike 2.5 License