Monday, December 29, 2008

Mechanical Turk Does Solubility on Google Spreadsheet

This has been in the works for several weeks but I think we finally have a practical use for the Amazon Mechanical Turk to help with processing solubility data for the Open Notebook Science Challenge. Many thanks to Tenesha Gleason for the assistance over the phone and online - and to Deepak Singh for getting the ball rolling.

The original idea was to try to crowdsource the measurement of solubility in the laboratory using Mechanical Turk. That probably isn't realistic with the current workforce but I can definitely see that it could be done with the pre-training of select groups around the world. If done properly this could be a very efficient way to get all types of scientific research executed. But I suspect it would be quite difficult to allocate government agency funds for this purpose in a proposal. Luckily other more flexible organizations (like Submeta) are popping up to fill the gap.

So what is realistic then?

We tried different task descriptions and price points to see if we could at least get workers to look up non-aqueous solubility data. There were a few takers - but only one actually found a valid non-aqueous measurement (benzaldehyde in ethanol). All the others either gave aqueous solubility or other irrelevant data like molecular weight or solvent density. Perhaps once we have trained some people to do this properly it will make sense to use Mechanical Turk this way. But right now there is just too much chemistry knowledge required to handle queries of this format.

One of the advantages of Open Notebook Science is that the work is naturally modular because it has to be shared in as close to real time as possible. That doesn't mean that it works flawlessly but it does make it more likely that information that has to be shared will be easier to understand and use by anyone or who wishes to move the project forward. True open crowdsourcing depends on this.

So lets come back to comparing solubility measurements made by our students with those in the literature. But now, instead of asking a high level query, we take the work we've already done on our side to find papers and ask the Turks to extract out the data from selected tables and figures.

I just tried this yesterday and it worked like a charm. I just grabbed a snapshot of the table of interest and asked to convert it to a Google Spreadsheet format that I can readily incorporate into the Solubiliby Summary master sheet. The Turk worker started working within minutes and was done in a quarter of an hour - total cost 46 cents.

This enabled me to add about 70 new entries into the master spreadsheet and allows us now to do quick comparisons using Rajarshi's web query interface. For example, it is pretty clear that the 4-nitrobenzaldehyde solubility measurements by Maccarone, E; Perrini, G. Gazzetta Chimica Italiana vol 112 p 447 (1982) in chloroform are consistent with our measurements in EXP212.

Now this is clearly NOT the case for the solubility of 4-nitrobenzaldehyde in methanol. Our 3 values from EXP212 and EXP205 are much lower than the values from the literature. Reading the experimental section we find that Maccarone and Perrini let their solutions equilibrate over 24 hours while we vortexed for only a few minutes. It is interesting that our chloroform values were consistent. Perhaps there are very large differences in solubility kinetics between solvents. Obviously we have to do this measurement again with much longer mix times.

On a more general note, using a Google Spreadsheet to collect Mechanical Turk results has the advantage that the progress can be monitored in real time and it is possible to chat with the worker. The person who worked on this project was obviously familiar with Google Spreadsheets and approached the task in a similar way that I would have. Here is a video of that process sped up ten fold:

Labels: , , , , , ,

Wednesday, December 17, 2008

NSF proposal: Crowdsourcing Chemistry and Modeling using Open Notebook Science

On December 8, 2008 I submitted the pre-proposal "Crowdsourcing Chemistry and Modeling using Open Notebook Science" with Rajarshi Guha and Antony Williams to the NSF CDI program.

Last year we submitted to the same initiative and the reviewer comments were positive for the most part. The main criticism was the lack of a more fully developed computational component. I think we've addressed that this year by including Rajarshi and his plans to carry out modeling of the non-aqueous solubility data and Ugi reaction optimization.

We also have the ONS Challenge in place and the sponsorship by Submeta, Nature and Sigma-Aldrich should help.

I posted the PDF version of the proposal on Scribd, linked to it from Noam Harel's SCIEnCE wiki and put up a text version on the ONSC wiki. In some ways proposals can be more important than papers to connect up collaborators and gain an appreciation of where science is headed. Ironically the only people to see proposals (the reviewers) are typically a research group's closest competitors. So making them public makes sense. It could also help funding agencies connect up with researchers.

I think it would be helpful to have a Web2.0 database of research proposals. The SCIEnCE project aims to do this but doesn't currently have a structured interface. I created a "Research Proposal" group on Scribd that is open for anyone to drop in proposals. That gives us the standard Web2.0 functionalities like commenting, visitor count, favorites, etc. One of the most convenient features of this strategy is that it provides an RSS feed for new submissions. I've added this feed to my FriendFeed account.

Labels: , , , ,

Sunday, December 14, 2008

Crowds, Solubility and the Future of Organic Chemistry

This week I participated in a Social Media Day at NIST. During my talk I provided an overview of our current work in using Web2.0 tools for doing Open Notebook Science in fields related to chemical synthesis and drug discovery.

During my talks I generally try to place our work in context and give the audience a sense of where I see science evolving. I often start with the increasingly important role of openness and at some point follow up with this slide showing the shift of scientific communication from human-to-human to machine-to-machine. My position is that we are entering a middle phase of human-to-machine and machine-to-human communication. This is essentially what the semantic web (Web3.0) is all about and the social web (Web2.0) is the natural gateway.

Giving a talk at NIST was particularly meaningful for me as an organic chemist. This is an organization that has always been associated with authoritative and reliable measurements. In chemistry, many properties of compounds are deemed important enough to be measured and recorded in databases.

Given that the vast majority of organic chemistry reactions are carried out in non-aqueous solvents, isn't it surprising that the solubility of readily commercially available compounds in common solvents is not considered a basic property, like melting point or density? You won't routinely find these values in NIST databases or ChemSpider or even toll-access databases like Beilstein and SciFinder. Tim Bohinsky has reviewed the literature to provide an idea of what is available for a few classes of compounds.

I think that the reason for this lack of interest in solubility measurements relates to the way synthetic organic chemists have learned to think about their workflows. Generally, the researcher sets up an experiment with the intention of preparing and isolating a specific compound. The role of the solvent is usually just to solubilize the reactants - it is then commonly evaporated for chromatographic purification of the product. Even in combinatorial chemistry experiments where products are not purified for a rough screening, the expectation is that compounds of interest will be purified and characterized at some point.

The advantage of this approach is that it is relatively reliable. Column chromatography and HPLC may be time consuming and expensive but these purification techniques will work most of the time. However, they are difficult to scale up and are not environmentally friendly.

Sometimes, during the course of a synthesis, a compound crystallizes either from the reaction itself or by a recrystallization attempt. When this happens, it is a lucky day. The problem is that you can't routinely guarantee purification of compounds this way. In the academic labs where I worked, that was always the case, although there were rumors of gurus with magic hands that could get crystals more often than most. Before chromatography became widely adopted I am sure chemists were much more adept at recrystallization by necessity.

But now technology is allowing for different ways of thinking about organic chemistry. Instead of attempting to make a specific compound, why not think about making any compound that meets certain criteria? If the objective is to inhibit an enzyme, then docking or QSAR predictions would be the first criterion. But we can add to that the requirement for the compound to be made from cheap starting materials using convenient reaction conditions and that it be purifiable by crystallization.

This last requirement would be predictable if we had robust models for non-aqueous solubility. We can only do that if we gather enough solubility measurements - and that is the point of the Open Notebook Science solubility challenge. We want it to become as easy to look up or predict the solubility of any compound in any solvent at any temperature as it is to Google. For a taste of things to come, play with Rajarshi Guha's chemistry Google Spreadsheet that calls web services to calculate weights and volumes of compounds based only on the common name and number of desired millimoles.

In thinking about what the future will look like it is tempting to imagine complex extrapolations of current concepts caught in the first part of the hype cycle - nanotechnology and artificial intelligence are good recent examples.

But much of the real progress is reflected by a simplification that is almost invisible and underestimated in its power to change the way things are done. Blogs, wikis and RSS are a wonderful example of that. Technically, these are very simple software components - probably not something people in the 1980s would have predicted to be the "advanced technology" to explode in the 21rst century. But it is precisely that simplicity, coupled with reliable free hosted services, that accounts for the explosive adoption of these tools.

Similarly, as a graduate student in the late 1980's, I imagined complex chemical reactions in academia by now to be carried out routinely by robots and synthetic strategies to be designed by advanced AI. From a purely technical standpoint, academic research probably could have evolved in that way but it didn't. That vision simply was not a priority for funding agencies, researchers and companies.

But now the Open Science movement, fueled by near zero communication costs, can add diversity to the way research is done. I'm predicting it will favor simplification in organic chemistry. Here's why:

In a fully Open Crowdsourcing initiative, where all responses and requested tasks are made public in real time, the numbers will dictate what gets done. There will always be more people with access to minimal resources than people with access to the most well equipped labs. Thus contributions to the solution of a task will likely be dominated by clever use of simple technology. This is what we expect to see for the ONS solubility project. As long as competent judges are available to evaluate the contributions and strictly rely on proof, the quality of the generated dataset should remain high.

Note that this may not be the expected outcome for crowdsourcing projects where the responses are closed (e.g. Innocentive). Responses to RFP's from traditional funding agencies, where funds are allocated to specific groups before work is done, are also unlikely to yield simplicity. Even to be eligible to receive those funds generally requires being part of an institution with a sizeable infrastructure. I'm not saying that traditional funding will disappear - just that new mechanisms will sprout to fund Open Science. The sponsorship of our ONS challenge with prizes from Submeta and Nature is a good example.

For synthetic organic chemistry, it doesn't get much simpler than mix and filter. We've already shown that such simple workflows can be automated with relatively low-cost solutions, with the results posted in real time to the public. Add to this crowdsourcing of the modeling of these reactions and we start to approach the right side of the diagram at the top of this post. See Rajarshi's initial model of our solubility data to date to see how it adds one more piece to the puzzle.

Labels: , , ,

Monday, December 01, 2008

NIST Social Media Day

I'll be participating in this NIST Social Media Day on December 11, 2008. Hopefully I'll be able to liveblog some of it on FriendFeed.
On December 11, Technology Services will sponsor a Social Media Day to explore web 2.0 technologies and how they can benefit our staff, in particular our scientists. Two keynote speakers, as well as a Guest Panel, have been confirmed for the event:

* Dr. Jean-Claude Bradley, Drexel University, will address "Open Notebook Science" in which experimental data is published in real time to blogs, wikis and other web tools. (10 AM, Red Auditorium)

* Emma Antunes (NASA Goddard), Scott Horvath and David Hebert (US Geological Survey), Don Burke (CIA), Gail Porter and Leon Gerskovic (NIST/PBA) and Doug Ward (USMS/NIST) will participate in a "Web 2.0 in Government" Panel Discussion (11 AM, Red Auditorium)

* Mr. Don Burke, CIA, will discuss "Intellipedia," the wiki created for federal intelligence agencies. (1 PM, Red Auditorium)

Other events feature the social media resources and tools used for the benefit of NIST scientists by the US Measurement System Office and the Information Services Division.

Labels: ,

Creative Commons Attribution Share-Alike 2.5 License