This blog chronicles the research of the UsefulChem project in the Bradley lab at Drexel University. The main project currently involves the synthesis of novel anti-malarial compounds. The work is done under Open Notebook Science conditions with the actual detailed lab notebook located at usefulchem.wikispaces.com. More general comments posted here relate to Open Science, especially when associated with chemistry.
Wednesday, January 27, 2010
ONS talk at UPenn Library
On January 21, 2010 I presented a talk at the van Pelt Library at the University of Pennsylvania about "Open Notebook Science and other Science2.0 Approaches to Communicate Research". I had a very interesting chat with some of the folks there who work with my host Shawn Martin.
The role of librarians is certainly changing. When I was in grad school the main interaction I remember is our librarian doing STN searches. Electronic databases were new and very expensive at that time and the librarian's skill was required to query the database in an efficient manner to minimize cost. Students were not allowed direct access.
I remember when I was allowed to ask for a substructure search of cyclobutene systems for my thesis - it felt magical and the results were like gold. There was no way for me to do this using index books. Now this type of search is so routine that students usually look bored doing it for their projects.
The issues for librarians now are completely different. The definition and meaning of scholarship is shifting. Instead of scarce resources, there is an abundance of tools, content and social networks out there - much of it free. Openness in all of its forms is becoming possible and is forcing people to take a position.
I think Shawn's approach makes a lot of sense. He tries to provide options for his faculty and students to choose from without imposing a particular philosophy. My talk was just another example to present.
[My only regret is that I used a new computer without checking the audio settings first and it was set a little too high so the audio quality isn't perfect. Hopefully it is still intelligible for the most part.]
I had the pleasure of meeting face to face for the first time people I have gotten to know quite well over the blogosphere: Steve Koch, Hope Leman, Walter Jessen, Pawel Szczęsny and Andy Farke. This is probably the best conference for me to catch up with friends and collaborators - Bill Hooker, Tony Williams, Cameron Neylon, Deepak Singh, Carmen Drahl, Dorothea Salo, Christina Pikas and several others.
My session on Second Life Saturday didn't work out so well. We had major connectivity problems both at the conference side (bandwidth maxing out and even the router getting unplugged at one point!) and on Second Life. We spent quite a bit of time before the session trying to get things under control but SL voice failed for everyone there after working briefly. I also got kicked out repeatedly and had trouble teleporting. I did manage to follow Max Chatnoir to her always impressive Genome island but only saw her type a few lines of chat.
That was very disappointing and I'm not sure I'll attempt another live demo like this again. After so many years in operation the Second Life servers really should be reasonable stable, given the annual fee we pay for our islands. I think a better use of the technology might be a parallel but separate track only on Second Life, where some of the presenter can display their posters for several days and visitors can leave comments or arrange to meet at certain times. This is what Andy Lang and I did for ACS island a while back and it worked fairly well.
The session on Open Notebook Science I co-chaired with Cameron Neylon and Steve Koch on Sunday went a lot better. I provided a context by demonstrating the utility of ONS in resolving the NaH oxidation controversy followed by the example of the aqueous solubility of EGCG, where the lack of access to raw data in the literature and company catalog leads to an necessarily confusing situation. At the end I mentioned the case where simply reading the lab notebook of Alexander Graham Bell exposed a scandal detailed in Seth Shulman's new book "The Telephone Gambit".
Within that framework I provided an overview of the ONSChallenge and the Wikispaces/Google Spreadsheet/Blogger system we use in my lab. Cameron then spoke a bit about the LaBLog system he uses and the broader scope of incorporating automation in the creation of the notebook records. Finally Steve reflected on his experience with OpenWetWare in both a teaching lab and his research group. He displayed some positive comments he received about ONS in a recent grant application. The discussion afterward moved into the challenge of archiving large amounts of data. I mentioned that we are still looking for a library partner for our ONSarchive project.
On Saturday night during dinner there was an "Ignite" style session where speakers are given about 5 mins to go through their slides, which change automatically every 20 seconds. I presented with Tony on Games in Chemistry. It turned out to be an eclectic collection of talks and are worth a watch when they are posted.
I enjoyed Jonathan Eisen's session on Open Access and Peter Binfield's on PLoS ONE article level metrics. I learned that the DOI must be used for the blog citation metric to work properly and that all the statistics can be downloaded as an Excel file. The scientific world operate much more smoothly if the mainstream adopted a fraction of the philosophies espoused in these sessions.
My favorite session was Andy Farke's demo on the Open Dinosaur Project to crowdsource the measurement of bones. It was exciting to see that his data management system using Google Spreadsheets is similar to our ONS Solubility Challenge. It is possible that he could use the code that Andy Lang wrote to activate bots to flag discrepancies and perhaps semi-automatically publish a book with a summary of the results in a similar way that we do. Instead of pictures of molecules his entries would have images of dinosaurs. We'll follow up to see what is feasible.
I just noticed that our number of views is almost at 8000 on JoVE. After a little over a year the views are still coming in at a fairly steady pace. Article-level metrics are one of the best things in the scientific publication process to have come along for authors.
Dangerous Data: Lessons from my Cheminfo Retrieval Class
I'm not sure what my students expected before taking my Chemical Information Retrieval class this fall. My guess is that most just wanted to learn how to use databases to quickly find "facts". From what I can gather much of their education has consisted of teachers giving them "facts" to memorize and telling them which sources to trust.
Trust your textbook - don't trust Wikipedia. Trust your encylopedia - don't trust Google. Trust papers in peer reviewed journals - don't trust websites.
If I did my job correctly they should have learned that no sources should be trusted implicitly. Unfortunately squeezing useful information from chemistry sources is a lot of work and hopefully they learned some tools and attitudes that will prove helpful no matter how chemistry data is delivered in the future.
I have previously discussed how trust should have no part in science. It is probably one of the most insidious factors infesting the scientific process as we currently use it.
To demonstrate this, I had students find 5 different sources for properties of chemicals of their choice. Some of the results demonstrate how difficult it can be to obtain measurements with confidence.
Here are my favorite findings from this assignment as a top 3 countdown:
Searching for chemical property information on Google quickly reveals the plethora of databases indexed on the internet with a broken chain of provenance. These range from academic exercises of good will to company catalogs, presumably there to sell products. Although it is usually not possible to find out the source of the information, you can sometimes infer the origin by seeing identical numbers showing up in multiple places.
But sometimes the results are downright bizarre - consider the number 1.009384166 as the density of resveratrol from what looks like a Japanese government site 3DMET. First of all no units are given but lets assume this is in g/ml. The number of significant figures is curious and suggests the results of a calculation, perhaps a prediction. In this case the source is from the MOE software. This is clearly a different algorithm from the one used by ACDLabs, which comes in at 1.356 g/ml, much more realistic when put up against all 5 sources:
#2 The melting point for DMT depends on the language
I have to admit being really surprised by this. Even though I knew that Wikipedia pages in different languages were not exact translations I would have assumed that the chemical infoboxes would not be recreated. Interestingly, the German edition has a reference but I was not able to access it since it is a commercial database. The English edition has no specific references. Here is a list of sources:
This is by far my favorite because it most clearly demonstrates the dangers of the concept of a "trusted source". From the compilation prepared by the student, this paper (Kwang08) reported the solubility of EGCG at 521.7 g/l:
This is from a paper that spent 5 months undergoing peer review with a well respected publisher. Also it appeared recently so one would expect the benefit of the best instruments and comparison with historical values. But even beyond all of this, the numbers are in the opposite order to the point explained in the paragraph. In our system of peer review we don't expect reviewers to verify every data point - but we do expect the text to be evaluated as logically consistent.
Now if we follow the reference provided for this paragraph we find the following paper (Liang06), with this:
We can now see what happened: the 21.7 was accidentally duplicated from the caffeine measurement and appended to the 5 g/l for EGCG. This is a lot more reasonable, even though I am not clear about where that number comes from in this second paper.
We can get some idea of the potential source of this information from the Specification Sheet for EGCG on Sigma-Aldrich: Notice that this does not state that the maximum solubility of EGCG in water is 5 mg/ml - just that a solution of that concentration can be made. This value is repeated elsewhere, such as this NCI document, which references Sigma-Aldrich: From here the situation gets muddled. Another search reveals this peer reviewed paper (Moon06), which appeared in 2006: Expressed in mM this translates to about 2.3 g/l. Clearly this value is inconsistent with the Sigma-Aldrich report of being able to make a clear solution at 5 g/l.
Luckily, in this case we have some details of the experiments:
The measurements were done in triplicate and averaged. Unfortunately this does not reveal any sources of systematic error. One clue as to why these values are contradictory might be the method of dissolution. One hour sonication at room temperature might just not be enough to make a saturated solution for this compound. (Although one might expect the error to lie on the high side because the sample were diluted before being filtered) What would answer this definitively are the experimental details of how the Sigma-Aldrich source prepared the 5 g/l solution. If it went in within a few minutes without much agitation, that would be inconsistent with this hypothesis of insufficient mixing. In that case we would want to look at the HPLC traces in this paper for another type of systematic error.
Unfortunately, the chain of information provenance ends here. Just based on the data provided so far, there is significant uncertainty in the aqueous solubility of EGCG, similar to our uncertainty about the melting point of strychnine.
As long as scientists don't provide - and are not required to provide by publishers - the full experimental details recorded in their lab notebooks, this type of uncertainty will continue to plague science and make the communication of knowledge much more difficult than it need be.
Unfortunately the concept of "trusted sources" is being used as a building block of some major chemical information projects currently underway - WolframAlpha and the chemical infobox data of Wikipedia are prime examples. Ironically, MSDS sheets are listed as a reliable "trusted source" for the infoboxes, when they have been shown to be very unreliable (see my previous post about this with statistics). These are probably one of the most dangerous sources of information because they appear to be trustworthy - coming from chemical companies and the government - and often found on university websites. Combine that with the absence of references or experimental details and the potential for replication of errors is very high and very difficult to correct.
WolframAlpha does have a mechanism to provide information about sources but it requires submitting a reason and personal information. To see how this works in practice I made a request for the source of an entry with erroneous data - glatiramer acetate: I submitted this 10 days ago and still don't know the source.
Rapid access to specific sources is important for maximizing the usefulness of databases. Without that it becomes very difficult to assess the meaning of reported measurements and compare with results from other databases.
It is not possible to remove all errors from scientific publication. But that's only a problem when it is difficult to determine that there are errors in the first place because insufficient information is provided.
Scientists can handle ambiguity. If you look at the discussion over the blogosphere concerning the JACS NaH oxidation paper, much of it was constructive. The publication of that paper was not a failure of science. Quite the opposite - we learned some valuable lessons about handling this reagent. As far as I can tell the paper was a truthful reporting of their results.
Where this was a failure lies in the way conventional scientific channels handled the matter. There was no mechanism to comment directly on the website where paper was posted. That would have been the logical place for the community to ask questions and have the authors respond. Instead the paper was withdrawn without explanation.