Sunday, February 01, 2009

Maintaining Solubility Data Provenance from Wikipedia to Lab Notebook

I recently noticed a bunch of incoming clicks from the Wikipedia entry for benzoic acid to our Open Notebook Science solubility challenge wiki on the ONSC sitemeter.

It turns out that Andrew Lang added a link from the Properties section covering the solubility of benzoic acid in a few organic solvents.

Instead of pointing directly to the individual notebook pages for each measurement, clicking on the reference takes us to a page summarizing all of the solubility measurements of benzoic acid in various solvents. The values are averaged and a standard deviation is provided.

Clicking on one of these links takes us to a summary page of each measurement for one of the solvents.
Clicking a link from that collection takes us to the laboratory notebook page (for example EXP005) on the wiki and ultimately to a Google Spreadsheet (for example) with all of the calculations for that measurement.

The beautiful thing is that the original url can't be any easier to create: acid

And this same link will always link to the best possible values as more measurements are made and erroneous values are removed if found. With Andy's current code, if measurements at different temperatures are made, a plot is provided.

Now we're getting into some interesting territory. It is now so simple to refer to the solubility data of the ONS challenge that people who don't know the first thing about wikis, blogs or coding can start to partcipate in the use of ONS measurements.

The Open Notebook can be thought of like insurance. We don't want to have to use it - but if there is a problem we have to ability to trace the chain of provenance all the way to the source.

By the way notice that this is the only link in the Properties section of this Wikipedia article with a reference...

At 1:09 AM, Blogger Egon Willighagen said...

Jean-Claude, I am starting to feel a need for versioning of the spreadsheet... if WP points to the original data, and a new measurement for some solvent is made, the average is going to change, invalidating the reference! That is, someone who will check the reference will find that Andrew messed up, by copying the wrong value, or so it will seem...

Have you considered publishing the outcome of the experiment, say, quarterly on N Precedings?

BTW, I do like the idea! Andrew, did you make links for all of them?

At 8:13 AM, Anonymous Anonymous said...

Hey I have a lab coming up and was wondering if you know how to seeparate bezoic acid from calcium carbonate?

At 8:38 AM, Blogger Jean-Claude Bradley said...

Because it works like a wiki you certainly could point to a version of the Spreadsheet - nothing gets deleted. But the idea behind the link is that it points to all available information dynamically. So it may not be necessary to put all the numbers in Wikipedia - maybe just a link to "non-aqueous solubilities". Or we could have a bot periodically populate the properties section of chemicals.

This is still much better than how most properties in Wikipedia are put in without references. And what happens when there is a correction in a regular paper? Eventually the author publishes and Erratum (if we're lucky) and it is left to the one searching to find it.

Also remember that nothing stops you from citing each measurement to the original lab notebook page directly. This is just a really convenient way to link to all current values with one link.

Yes, we will likely publish select experimental pages in Precedings and ChemSpider Journal for example - the bottleneck is getting the students to address all the judges questions and clean up the discussion section - dot the i's and cross the t's. In the meantime we can still use the data.

At 8:39 AM, Blogger Jean-Claude Bradley said...

Anonymous - sure you could use solubility to separate these. Pick one of the solvents that we are listing here that is not miscible with water and do an extraction.

At 10:09 AM, Blogger Egon Willighagen said...

I fully agree this is a major step forward for physical properties in Wikipedia! And I am really happy to see the link!

I was just thinking ahead, and wondered what the link would mean to me if I saw it in 5 years from now... or 25.

At 10:16 AM, Blogger Jean-Claude Bradley said...

Five years is a long time - Wikipedia is likely to evolve a lot in that time. In the meantime this link gives chemists the information they are looking for immediately I think. If you were citing it in a paper you might give the access date or archive the page. I suppose it all depends on how and where the link is used.

At 4:40 PM, Blogger Cameron Neylon said...

There is a deeper point here more broadly about the dating of citations and provenance, especially when you are referring to an aggregation of dynamic data. The ideal would be for the webservice to allow the data to be extracted from a specific version of the spreadsheet. Is this possible through the API?

I think this is going to become a very general problem - as someone said to me the other day its a shame that URLs don't have a timestamp in them...

At 6:52 PM, Blogger Jean-Claude Bradley said...

You could always state the date of the Google Spreadsheet version and people could back to that if they wanted (just go to file->revision history). But I doubt many people would be interested in an outdated version, just like you don't look up a Wikipedia past version when you're given a Wikipedia reference - although nothing stops you.

This is just a dynamic link to the most recent information. You could link to individual lab notebook pages if you preferred for a given application.

At 5:13 AM, Blogger Cameron Neylon said...

It depends a lot on what statement is being cited. In the past citations have generally been of the form "the solubility of A is B [citation]". When the data is dynamic this breaks, but if we believe citation is important then it is important to be able to cite that specific figure. Otherwise there is no way to tell whether a mistake is honest or deliberate, or even to check whether a mistake has been made.

Or to put it another way without that level of granularity the citer can't actually provide evidence of what they are saying.

Now one can imagine in the future one would like to cite in this way "the solubility of A is [dynamic link to dataset][citation]" which is much more powerful but it doesn't fit with the current notion of citations as part of a paper, which is an object fixed at a specific time.

At 5:41 AM, Blogger Jean-Claude Bradley said...

If you are talking about citing solubilities in a paper then you would not just give the dynamic link - like usual you would link to specific references, whether lab notebook pages or other papers. I would probably use a format like this: Our measurements are consistent with other solubility measurements (1-4). Refs 1 to 3 would be individual references and 4 would be "A continuously updated list of measurements can be found at - then put the dynamic link".

Another is point is that nothing is deleted from the spreadsheet or the lab notebook wiki. If a measurement is subsequently labeled as DONOTUSE that will be in the version history of the Google Spreadsheet and the wiki.

Another approach would be to take a snapshot of one of the solubility results page and store as a separate web page and link to that. Hopefully our project with the Drexel library will generate some example soon of how libraries can help by crawling our ONS projects and regularly taking snapshots for third party archiving.

The nice thing about these tools is their flexibility. It is up to the person citing and the editors and reviewers of an article making use of these technologies to decide what is appropriate. It certainly can't be a step back from our current system :)


