Sunday, May 08, 2011

Breast Cancer Coalition talk on ONS and Taxol solubility

On May 1, 2011 I presented "Accelerating Discovery by Sharing: a case for Open Notebook Science" at the National Breast Cancer Coalition Annual Advocacy Conference in Arlington, VA. This was the first year where they had a session on an Open Science related theme and the organizers invited me to highlight some of the tools and practices in chemistry which might be applicable to cancer research.

I was really touched by the passion from those in the audience as well as the other speakers and conference participants I met afterward. For many, their deep connection with the cause was strongly rooted in a personal experience as breast cancer survivors themselves or their loved ones. Several expressed a frustration with the current system of sharing results from scientific studies. They felt that knowledge sharing is much slower than it needs to be and that potentially useful "negative" results are generally not disclosed at all.

The NBCC has ambitiously set 2020 as the deadline to end breast cancer (including a countdown clock). It seems reasonable to me that encouraging transparency in research is a good strategy to accelerate progress. Of course, great care must be exercised wherever patient confidentiality is a factor. But health care researchers are already experienced with following protocols to anonymize datasets for publication. Opting to work more openly would not change that but it might affect when and how results are shared. Also there is a great deal of science related to breast cancer that does not directly involve human subjects.

One initiative that particularly impressed me was The Susan G. Komen for the Cure Tissue Bank, presented by Susan Clare from Indiana University and moderated by Virginia Mason from the Inflammatory Breast Cancer Research Foundation. As a result of this effort, thousands of women have donated healthy breast tissue to create a comprehensive database richly annotated with donor genetics and medical history. The idea of trying to tackle a disease state by first understanding normal functioning in great detail was apparently somewhat of a paradigm shift for the cancer research community and it was challenging to implement. According to Dr. Clare, data from the Tissue Bank have shown that the common practice of using apparently unaffected tissue adjacent to a tumor as a control may not be valid.

This example highlights one of the key principles of Open Science: there is value in everyone knowing more - even if it isn't immediately clear how that knowledge will prove to be useful.

In my experience, this is a fundamental point that distinguishes those who are likely to favor Open Science from those who reject its value. If two researchers are discussing Open Science and only one of them views this philosophy as being self-evident the conversation will likely be about why someone would want (or not want) to share more and the focus will fall on extrinsic motivators such as academic credit, intellectual property, etc. If both researchers view this philosophy as self-evident the conversation will probably gravitate towards how and what to share.

I refer to this philosophy as being self-evident because I don't think people can become convinced through argumentation (I've never seen that happen). Within the realm of Open Notebook Science I have been involved in countless discussions about the value of sharing all experimental details - even when errors are discovered. I can think of a few ways in which this is useful - for example telegraphing a research direction to those in the field or providing data for researchers who study how science is actually done (such as Don Pellegrino). But even if I couldn't think of a single application I believe that there is value in sharing all available data.

A good example of this philosophy at work is the Spectral Game. Researchers who uploaded spectral data to ChemSpider as Open Data did not anticipate how their contribution would be used. They didn't do it for extrinsic motives such as traditional academic credit. Assuming that their motivation was similar to our group's, they did it because they believed it was an obviously useful thing to do. It is only much later - after a critical mass of open spectra were collected - that the idea arose to create a game from the dataset.

With this mindset, I explored what contribution we might make to breast cancer research by performing a phrase search strategy. Doing a simple Google search for "breast cancer" solubility generated mainly two types of results.

The first set involve the solubility behavior of biomolecules within the cellular environment. An example would be the observed increased solubility of gamma-tubulin in cancerous cells.
The second type of results address the difficulty in preparing formulations for cancer drugs due to solubility problems. A good example of this is Taxol (paclitaxel), where existing excipients are not completely satisfactory - in the case of Cremophor EL some patients experience a hypersensitivity.
Since our modeling efforts thus far have focused on non-aqueous solubility, there is possibly an opportunity to contribute by exploring the solubility behavior of paclitaxel. By inputting solubility data from a paper by Singla 2002 into our solubility database, Abraham descriptors for paclitaxel are automatically calculated and the solubilities in over 70 solvents are predicted.

In addition, by simply adding the melting point of paclitaxel, we automatically predict its solubility at any temperature where these solvents are liquids (see for example water).

Because of the way we expose our results to the web, a Google search for "paclitaxel solubility acetonitrile" now returns the actual value in the Google summary on the first page of results (currently 7th on the first page). The other hits have all 3 keywords somewhere in the document but one has to click on each link then perform a search within the document to find out if the acetonitrile solubility for paclitaxel is actually reported. (Note that clicking on our link ultimately takes you to the peer-reviewed paper with the original measurement.)

To be clear about what we are doing here - we are not claiming to be the first to predict the solubility of paclitaxel in these solvents using Abraham descriptors or any other method. Nor are we claiming that we have directly made a dent in the formulation problem of paclitaxel. We are not even indicating that we have done a thorough search of the literature - that would take a lot more time than we have had given the enormous amount of work on paclitaxel and its derivatives.

All we are doing is fleshing out the natural interface between the knowledge space of the UsefulChem/ONS Challenge projects and that of breast cancer research - AND - we are exposing the results of that intersection through easily discoverable channels. By design, these results are exposed as self-contained "smallest publishable units" and they are shared as quickly (and as automatically) as possible. The traditional publication system does not have mechanism to disseminate this type of information. (Of course when enough of these are collected and woven into a narrative that fits the criteria for a traditional paper they can and should be submitted for peer-reviewed publication).

Here is a scenario for how this could work in this specific instance. A graduate student (who has never heard of Open Science or UsefulChem, the ONS Challenge, etc.) is asked to look for new formulations for paclitaxel (or other difficult to solubilize anti-cancer agents). They do a search on commercial databases offered by their university for various solubilities of paclitaxel and cannot find a measurement for acetonitrile. They then do a search on Google and find a hit directly answering their query, as I detailed above. This leads them to our prediction services and they start using those numbers in their own models.

That is a good outcome - and that is exactly what has been happening (see the gold nanodot paper and the phenanthrene soil contamination study as examples). But the real paydirt would come from the graduate student recognizing that we've done a lot of work collecting measurements and building models for solubility and melting points, and contact us about a collaboration. As long as they are comfortable with working openly we would be happy actively work together.

I'm using the formulation of paclitaxel as an example but I'm sure that there are many more intersections between solubility and breast cancer research. With a bit of luck I hope we can find a few researchers who are open to this type of collaboration.

As another twist to this story, I will briefly mention here too that Andrew Lang has started to screen our Ugi product virtual library for docking with the site where paclitaxel binds to gamma-tubulin (D-EXP018). This might shed some light on some much cheaper alternatives to the extremely expensive paclitaxel and derivatives. The drug binds through 3 hydrogen bonds, shown below - rendered in 2D and 3D representations (obtained from the PDB ligand viewer)


The slides and recording of my talk are embedded below:


Labels: , , , , ,

Saturday, May 07, 2011

Evan Curtin is the May 2011 RSC ONS Challenge Winner

Evan Curtin, a chemistry freshman student working under the supervision of Jean-Claude Bradley at Drexel University, is the May 2011 Royal Society of Chemistry Open Notebook Science Challenge Award winner. He wins a cash prize from the RSC.

Evan's primary focus has centered on synthesizing aromatic imines and measuring their solubility in a number of organic solvents. This will allow us to generate Abraham descriptors for this class of compounds in order to predict their solubility in 70+ solvents. Coupled with our new model to include temperature dependent solubility, this should greatly facilitate optimal solvent prediction for this and related reactions.

Imine formation is of particular interest to the UsefulChem group because it is the first step of the Ugi reaction, which we have used to synthesize compounds with anti-malarial activity. But it is also a simple convenient reaction in itself to test our Solvent Selector's ability to predict optimal conditions (solvent and temperature) for isolation of products by precipitation.

Evan's synthesis experiments are available here:
http://usefulchem.wikispaces.com/Exp263
http://usefulchem.wikispaces.com/Exp262
http://usefulchem.wikispaces.com/Exp261


and his solubility experiments are listed here:

http://onschallenge.wikispaces.com/Exp207
http://onschallenge.wikispaces.com/Exp206
http://onschallenge.wikispaces.com/Exp205
http://onschallenge.wikispaces.com/Exp204
http://onschallenge.wikispaces.com/Exp201
http://onschallenge.wikispaces.com/Exp198
http://onschallenge.wikispaces.com/Exp197

Three more RSC ONS Awards will be made during 2011. Submissions from students in the US and the UK are still welcome.
For more information see:
http://onschallenge.wikispaces.com
http://onschallenge.wikispaces.com/RSCAwards2010

Labels: , , , ,

Thursday, October 07, 2010

Drexel Chemistry Mini-Symposium on Bradley Lab

Every year the chemistry department at Drexel gives faculty the opportunity to present their research to incoming students in 10 minutes slots. On September 30, 2010 I presented on "Open Notebook Science for Malaria Drug Discovery and Solubility Modeling". I think such a short format is good for keeping student attention. Recording it also provides a handy link to use for other purposes. Most people just don't have time for 30-60 minute presentations.


Labels: , , ,

Monday, August 23, 2010

ChemTaverna Workflows of ONS Web Services now on MyExperiment

I'm pleased to report that one of the collaborations initiated at the Berkeley Open Science conference last month is progressing very well.

Carole Goble introduced me to Peter Li who runs the ChemTaverna project. The idea was to use Taverna to construct workflows using the web services developed by Andrew Lang for our Open Notebook Science projects: UsefulChem and the ONS Solubility Challenge.

Peter quickly created several workflows to demonstrate what is possible. Here is a workflow that uses a Google Spreadsheet as input. SMILES for amines, carboxylic acids, aldehydes and isonitriles are entered in the appropriate columns. The workflow first creates a virtual library of Ugi products from all possible combinations of reactants. Then each product is submitted to a web service that predicts the solubility in methanol, the most common solvent for Ugi reactions.

The resulting spreadsheet can then be sorted by predicted solubility to recommend products that are more likely to precipitate from the reaction mixture. In this particular example Ugi products derived from boc-glycine are predicted to have a low solubility in methanol. The least soluble compound is predicted to have a solubility of only 0.07MIn this library, Ugi products derived from boc-methionine are predicted to be too soluble to precipitate. For example this Ugi product has a predicted solubility of 3.7 M.
(note: ChemSpider has a tendency to draw the minor tautomer for some amides and carbamates)

There are a few issues to take into consideration in order to use this particular workflow:

1) This will only work on Taverna Workbench 2.1.2 with these plug-ins installed. At one point it will be made to work on Taverna Workbench 2.2 and uploaded onto MyExperiment. The workflow used here is currently available here.

2) The SMILES in the input Google Spreadsheet must be written in the format of the current example (aldehyde, amine and isonitrile groups on the left and carboxylic acid groups on the right)

3) All of the Ugi products in the virtual library must already exist in ChemSpider. Otherwise, the solubility predictions will fail because of missing descriptors as discussed previously.

Peter has uploaded simpler workflows onto MyExperiment that are compatible with the current version of Taverna Workbench (v2.2).

First, the generation of Ugi product libraries from reactant SMILES in a Google Spreadsheet is available here.

Another workflow handles the prediction of Abraham descriptors.
This workflow processes the prediction of solubility for a given solute and solvent.

The main rationale for incorporating web services derived from our Open Notebook Science projects into Taverna is leverage. MyExperiment already benefits from a vigorous community of developers in the bioinformatics arena. With the growth of the ChemTaverna initiative, the integration of cheminformatics and bioinformatics workflows should become seamless.

By making our solubility and chemical reaction web services available in formats that are convenient for others to use it increases the opportunities that our work will be actually useful. It also makes it easier for us to leverage the resources made available by others for our own applications in drug discovery and reaction design.

Essentially this means that we have extended the reach of the information cascade triggered by the recording of an experiment in a laboratory notebook and a very simple abstraction process to represent that experiment in a semantically addressable format.

Labels: , ,

Sunday, July 25, 2010

General Transparent Solubility Prediction using Abraham Descriptors

Making solubility estimations for most organic compounds in a wide range of solvents freely available has always been a main long term objective for the Open Notebook Science Solubility Challenge. With current expertise and technology, it should be as easy to obtain a solubility estimate as it is now to get driving directions off the web.

Obviously this won't be attained purely by exhaustive measurements, although we have been focused on strategic measurements over the past two years. In parallel, we have been constantly evaluating the various solubility models out there for suitability.

Although there are several solubility models available for non-aqueous solvents, our additional requirement for transparent model building has proved surprisingly difficult to satisfy.

From this search, the Abraham solubility model [Abraham2009] floated to the top, with an important factor being that Abraham has made available extensive compilations of descriptors for solutes and solvents. In addition the algorithms used to convert solubility measurements to Abraham descriptors (a minimum of 5 different solvents per solute) has allowed us to generate our own Abraham descriptors automatically simply by recording new measurements into our SolSum Google Spreadsheet. These can be obtained in real time as well.

This approach permitted us to provide predictions for a limited number of solutes in a wide range of solvents and we have included these predictions in the past two editions (2nd and 3rd) of the ONS Challenge Solubility Book.

Coming at the problem from a different approach, Andrew Lang has also been trying to predict solubility using only open molecular descriptors, mainly relying on the CDK. Since our most commonly used solvent has been methanol, Andy recently generated a web service to predict solubility in that solvent.

By combining these two approaches, Andy has now created a modeling system that can not only generally predict solubility in a wide range (70+) of solvents - but it can also provide related data that can be used for modeling other phenomena such as intestinal absorption of a drug or crossing the blood-brain barrier.[Stovall 2007]

The idea is to use a Random Forest approach to select freely available descriptors to predict the Abraham descriptors for any solute. A separate service then generates predicted solubilities for a wide range of solvents based on these Abraham descriptors. I'm using the term "freely available" because - although the CDK descriptors and VCCLab services are open - the model requires 2 descriptors only available from ChemSpider (ultimately from ACD/Labs).

Here is an example with benzoic acid. As long as the common name resolves to a single entry on ChemSpider, it is enough to enter it and it automatically populates the rest of the fields, which are then used by the service to generate the Abraham descriptors.

Hitting the prediction link above will automatically populate the second service and generate predicted solubilities for over 70 solvents.

This approach of allowing people to access these components separately can be useful. It can be instructive to manually play with the Abraham descriptors directly to see how predicted solubilities are affected. There are also situations where one has experimentally determined Abraham descriptors for a solute and bypassing the descriptor prediction step is required.

However, for those who prefer to cut to the chase, a convenient web service is available where the common name (or SMILES) of the solute is entered and the list of available solvents appears as a drop down menu.

Now here is where I think the real payoff comes for accelerating science with openness. Andy has also created a web service that returns the predicted solubility in molar as a number from common names (or SMILES) for solute and solvent via the URL. For example click this for benzoic acid in methanol. The advantage here is that solubility prediction can be easily integrated as a web service call from intuitive interfaces such as a Google Spreadsheet to enable even non-programmers to make use of the data. Notice that the web service provided in the fourth column for the average of measured solubility values enables an easy way to explore the accuracy of specific predictions.

Such web services could also be integrated with data from ChemSpider or custom systems. If those who use these services feed back their processed data to the open web, it could take us a step closer to automated reaction design. For example consider the custom application to select solvents for the Ugi reaction. Model builders could also use the web services for predicted and measured solubility directly.

A while back we explored using Taverna for MyExperiment to create virtual libraries of SMILES. Unfortunately we ran into issues with getting the applications developed on Macs to run on our PCs. This might be worth revisiting as a means of filtering virtual libraries through different thresholds of predicted solubility.

Andy has described his model in detail in a fully transparent way - the model itself, how it was generated and the entire dataset can be found here. We would welcome improvements of the model as well as completely new models based on our dataset using only freely available tools.

It should be noted that when I use term "general" it refers to the ability for the model to generate a number for most compounds listed in ChemSpider. Obviously compounds that most closely resemble the training set are more likely to generate better estimates. Because of our synthetic objectives using the Ugi reaction we have mainly focused on collecting solubility data for carboxylic acids, aldehydes and amides either from new measurements or from the literature.

Another important point concerns the main intended application of the model: organic synthesis. Generally the range of interest for such applications is about 0.01 - 3M. This might be very different for other applications - such as the aqueous solubility of a drug, where distinctions between much lower solubilities may be important.

For a typical organic synthesis, a solubility of 0.001M or 0.005M will probably translate as effectively insoluble. This might be a desired property for a product intended to be isolated by filtration. On the other end of the scale knowing that a solubility is either 4M or 6M will not usually have an impact on reaction design. It is enough to know that a reactant will have good solubility in a particular solvent.

Given the above considerations for intended applications and the likelihood that the current model is far from optimized, the predictions should be used cautiously. We suggest that the model is best used as a "flagging device". For example, if a reaction is to be carried out at 0.5M, one may place a threshold at 0.4M for the predicted values of reactants during solvent selection, with the recognition that a predicted 0.4M may be an actual 0.55M. A similar threshold approach can be used for the product, where in this case the lowest solubility is desired. A practical example of this is the shortlisting of solvents candidates for the Ugi reaction.

Another example of flagging involves identifying the outliers in the model. These can be inspected for experimental errors and possibly remeasured. Alternatively outliers may shed light on the limitations of the model. For example we have found that the solubility of solutes with melting points near room temperature can be greatly underestimated by the current model. This may be an opportunity to develop other models which incorporate melting point or enthalpy of fusion.[Rohani 2008]

Although it is possible that better models and more data will improve the accuracy of the predictions, this can be true only if the training set is accurate enough. Based on conversations I've had with researchers who deal with solubility, reading modeling papers and our own experience with the ONS Challenge I am starting to suspect that much of the available data just isn't accurate enough for high precision modeling. Models using data from the literature are especially vulnerable I think. Take a look at this unsettling comparison between new measurements and literature values (not to mention the model) for common compounds.[Loftsson 2006] Here is a subset:
I have also made the point in detail for the aqueous solubility of EGCG. Could this be the reason that so many different solubility models using different physical chemistry principles have evolved and continue to co-exist?

The situation reminds me a lot of the discussions taking place in the molecular docking community.[Bissantz 2010] The differences in calculated binding energies are often small in comparison with the uncertainties involved. But docking can still be used as one tool among others to find drug candidates by flagging a collection of compounds above a certain threshold binding energy.

Labels: , , , ,

Thursday, July 08, 2010

Methanol Solubility Prediction Model 4 for Ugi reactions in the literature

Since non-aqueous solubility measurements have not become part of the standard characterization of organic compounds, it is not surprising that all the data we have for Ugi products originate from measurements that we made on our own compounds.

Since methanol is our most common solvent, Andrew Lang has collected the measurements that we have with values from the literature for a range of compounds, including our Ugi products, to generate a web service returning a predicted solubility based on a submitted SMILES string. The model (Model 4) was derived from a Random Forest algorithm, using molecular descriptors supplied by the CDK and VCC.

It would be nice to be able to test the model's ability to predict what will happen if a Ugi reaction is carried out in methanol. Although the actual solubility of Ugi products in the literature is typically not reported, reading the experimental sections in papers can still provide some validation of the model in some cases.

For example, consider the following Ugi products synthesized recently by Lezinska (Tetrahedron 2010)


Note that these images represent the azide group not following the octet rule. It is necessary to represent the structure SMILES without charges because the CDK and VCC web services used by the model do not process charges correctly. Stereochemistry also cannot be used and this can be removed from the SMILES simply by deleting slashes. Thus for the two molecules above the SMILES to be submitted to the prediction web service are:

O=C(NC1CCCCC1)C(Cc2ccc(C)cc2)N(c4ccccc4C(=O)c3ccccc3)C(=O)C(Cc5ccccc5)N=N#N
AND
O=C(NC1CCCCC1)C(C(=O)c2ccccc2)N(Cc3ccc(C)cc3)C(=O)C(C)CCN=N#N

The predicted methanol solubilities are respectively 0.004 M and 0.03 M.

Now if we look at the details in the experimental section, both of these Ugi products were synthesized in methanol at a limiting reactant concentration of about 0.1 M. Even though this is much more dilute than the usual 0.5-2.0 M generally recommended for Ugi reactions (Domling 2000), the products still precipitate and can be filtered off. This is consistent with the predicted solubilities above and the model would have suggested ahead of time that methanol might be a good solvent for isolation of the products by precipitation.

So far these are just anecdotal results but it does illustrate that solubility models can be evaluated without explicit determination of solubility in the literature.

Labels: , ,

Friday, March 19, 2010

RSC Sponsors Open Notebook Science Challenge

I am very pleased to report that the Royal Society of Chemistry is sponsoring 5 new $500 awards for the Open Notebook Science Solubility Challenge.

The previous round of 10 awards was sponsored by Submeta, Nature and Sigma-Aldrich. With the final award of that round having been made in December 2009, this is very good timing.

The criteria and rules for the contest have not changed. Students from the US and the UK are generally eligible to participate. See the Rules and Application Form for full details:
http://onschallenge.wikispaces.com/RSCAwards2010

All of the solubility measurements will continue to be compiled and distributed in several formats, including a book where biographies and pictures of all the award winners can be found. The most recent edition - with all 10 previous winners - is available here:
http://precedings.nature.com/documents/4243/version/3

I am very grateful to Antony Williams for being instrumental in making this happen.

Labels: , ,

Wednesday, March 10, 2010

Updated Chemistry Web Services - now with Density

I mentioned a while back the web services that Rajarshi Guha had set up for us. We are often in need of molecular weight and density data for both solutes and solvents since we rely on an assumption of volume additivity when calculating concentration.

Since Rajarshi moved to the NIH, the location of the services has changed. We now have the CDK installed on a Drexel server so some of the simple services like MW and SMILES generation are still available there.

However density has been challenging to provide as a service. Experimental density values for solvents are commonly available but the calculated densities of solids is hard to find. ChemSpider is one of the few sources where calculated densities of solids and liquids are freely available. Unfortunately there are currently no ChemSpider density web services.

As an interim solution for the UsefulChem and ONSChallenge projects we have set a look-up table as a Google Spreadsheet (SolventLookUp) for most solvents of potential interest. Solutes added to our SolubilitiesSum sheet are automatically added to a SoluteLookUp SQL database running at Oral Robert University and the ChemSpider densities are added there via an automated but slow process.

Andrew Lang has used these resources to provide web services returning densities and other properties or descriptors. These data sources are especially important for the nearly automated production of new editions of the ONS Challenge Solubility Book. This is not a general solution since it only includes compounds of interest to our group and would not scale (at least for licensing reasons) to millions of compounds.

But it does come in handy for us because we can quickly call these services within a Google Spreadsheet to do a variety of useful calculations, minimizing the possibility of error by copy and pasting.

As an example see the following ChemServices sheet. Enter the common name for a solvent or solute and the number of millimoles and the sheet will automatically calculate the corresponding number of milligrams or microliters. [Note that Google Spreadsheets can only handle a maximum of 50 web service calls at a time - a useful trick is to highlight cells after the calculations then copy and "paste as values". Make sure to keep some cells with the web service calls in case you need to do more calculations in the future]

Labels: , , ,

Friday, February 12, 2010

ONS Solubility Book: Edition 3 with Notebook Archive

Edition 3 (2010-02-11) of the ONS Solubility Challenge book is now available.

We've been trying for some time to find a way to conveniently take a snapshot of our Open Notebooks and all associated raw data files. This could serve as a way to back up all of our work as well as provide a means of finding out the state of knowledge for a project at a given moment in time. There is also a tremendous benefit to confidently using the best of free hosted Web2.0 services out there (e.g. GoogleDocs and Wikispaces) without being concerned with changes in policies or access down the road.

Our recent use of the ONS Challenge Solubility book to periodically create releases of summarized data has opened up a convenient opportunity. And yesterday the last piece of the puzzle fell into place. Through a combination of fairly quick manual and automated tasks, Andrew Lang and I are able to push out a full snapshot of all relevant files and lab notebook pages and associate it with an edition of the book.

As described below, the archive is accessible interactively on a server, as a zip download or as a CD from LuLu. Perhaps we can also find a home on library servers in the future.

More details are provided in the preface for Edition 3 (2010-02-11):
This is the first edition to include a full archive of the ONS Challenge notebook. A space export from Wikispaces provides an initial version of all the HTML pages in the notebook with local hyperlinks to copies of all images and files uploaded onto the wiki. All of the Google Spreadsheets are automatically downloaded as Excel spreadsheets and placed in the same "files" folder as the images. NMR spectra, stored as JCAMP-DX files, are placed in the "spectra" folder. All of the HTML pages are reformatted to provide local references to both Excel spreadsheets and the JCAMP-DX files.

The notebook archive is meant to represent a snapshot of the state of all source documents at the time of the publication of an edition of this book. When used from a server with web services running, clicking on links to the spectra will allow interaction via a browser interface, including zooming in or out and integration of the NMR spectrum. When accessed in stand-alone mode after downloading or directly from a CD, everything will work the same, except that JCAMP-DX files must be open from JSpecView running on the desktop. Excel files will retain any calculations in the cells of the original Google Spreadsheets but dynamic values generated from calling web services - such the script that automatically integrates NMR spectra - will be frozen as simple values. However the link to the web service used will be stored in the cell as a comment. Links to external websites are not crawled and embedded Google Spreadsheets or videos are not copied. These will work but will reflect live data on the web.

The February 11, 2010 version of the notebook archive is available on a hosted site, on a CD or by download.

Labels: , , , , , ,

Tuesday, December 29, 2009

ONS Solubility Book: Edition 2 - with Predicted Values

The Second Edition (2009-12-27) of the Open Notebook Science Solubility Challenge book is now available. The issues with some missing text have been resolved, in addition to providing clickable links for the references in the PDF version.

However, the main difference is the addition of a new section on solubility predictions. The book is now somewhat larger than the first edition, coming in at 129 pages but still very affordable at $8.16 (covers printing and shipping costs from LuLu).

This was added to the preface:
Predicted Solubilities

In this edition, a new section is added to provide predicted solubility values for selected solutes in a range of solvents. Specifically, solutes are included when measurements from at least 5 different solvents are available. A method using Abraham descriptors depends on the experimental solubility measurements from several solvents to make predictions, which is detailed in that section of the book. For this reason, this edition also includes some aqueous solubility measurements, which are generally available from the literature. The focus of this collection remains on non-aqueous solubility.

Consistent with how the experimental measurements are made available, the predicted solubility values are provided as a work in progress. The purpose in providing them is to suggest solvents of interest for various applications. The boiling point of each solvent is also listed in the table to allow a convenient selection. When available, experimental measurements are listed next to the predicted values. This information can be helpful to gauge the usefulness of the model to some extent but does not guarantee its reliability for the other solvents. As more measurements are collected the reliability of the predictions is likely to increase and this will be reflected in future editions of this book.
Andrew Lang has been busily learning about building models using Abraham descriptors. As luck would have it, Michael Abraham just published an extensive collection of his descriptors for many solvents in a recent publication:
Abraham M.H.; Smith R.E.; Luchtefeld R.; Boorem A.J.; Luo R.; Acree Jr. W.E. Prediction of solubility of drugs and other compounds in organic solvents. J. Pharm. Sci. Early View Sept. 22 (2009) http://dx.doi.org/10.1002/jps.21922
This is an important step for the ONS Challenge project by taking us closer to the eventual goal of providing chemists an open tool for anticipating the solubility behavior of their reactants and products in a particular solvent. Researchers might think of trying new solvents after perusing their measured or predicted solubilization potential for a given solute.

We don't know how good the predictions will turn out but we will certainly find out in the coming months and report as we go. Even though the Submeta awards have all been distributed we still welcome measurement contributions.



Labels: , ,

Saturday, December 12, 2009

First Edition of ONS Solubility Challenge Book

Andrew Lang and I have been working on a book version of the Open Notebook Science Solubility Challenge database. The timing is good since we just awarded the last ONS Challenge Submeta award this month. All of the students, judges and educational partner are included as co-authors. A biography and picture of everyone is included in the book.
Jean-Claude Bradley, Associate Professor of Chemistry at Drexel University
Cameron Neylon, Senior Scientist at the ISIS Pulsed Neutron Source, Rutherford Appleton Laboratory and Lecturer in Chemical Biology at the School of Chemistry at the University of Southampton
Rajarshi Guha, Research Scientist at the NIH Chemical Genomics Center
Antony Williams, Vice President of Strategic Development, ChemSpider at the Royal Society of Chemistry
Bill Hooker, Postdoctoral Researcher in Molecular Biology
Andrew Lang, Professor of Mathematics at Oral Roberts University
Brent Friesen, Associate Professor of Chemistry at Dominican University
and
Tim Bohinski, David Bulger, Matthew Federici, Jenny Hale, Jenna Mancinelli, Khalid Mirza, Marshall Moritz, Daniel Rein, Cedric Tchakounte, and Hai Truong
We selected LuLu as a convenient mechanism to distribute copies. This 6 x 9 inches black and white soft cover edition is available for $5.96, which just covers the printing and shipping charges. Other formats are possible - such as a larger hardcover in color - but these are much more expensive. We thought it would be good to start with the most affordable version and look at other options later. The electronic version of the book is available for free on LuLu.

We were inspired by the style of the solubility book published by Atherton Seidell in 1919, freely available on Google Books. The compound entries are listed in alphabetical order, with tables of compound data and solubilities. We included data that we found to be useful for practical applications, including predicted density, room temperature phase and the solubility in molarity, mole fraction and g/100g solvent. References link to lab notebook pages or literature references.

Andy found a way to create the fully formatted book in an almost completely automated way, pulling the data directly from the Solubilities Summary and other Google spreadsheets and querying ChemSpider. The preface and biographies of the students, judges and educational partner are also automatically pulled in from Google Docs. With this system in place, it will be straightforward to publish future editions with the most updated information frequently.

This was also a good opportunity to make use of the WebCite service. It enables us to link the book to a frozen version of the Solubilities Summary sheet archived as an Excel spreadsheet. This format retains all the formulas and hyperlinks in the original Google Spreadsheet.

The preface further explains the scope of the book and project:

The Open Notebook Science Solubility Challenge

Solubility is an important consideration for many chemistry applications. Synthetic chemists usually use a solvent to perform reactions and knowledge of the solubility of the starting materials or products can be very useful to pick an appropriate solvent. Analytical chemists can use solubility to design separation techniques and factor in dynamic range considerations. Physical chemists can create and evaluate their models of how molecules interact in the solubilization and precipitation processes.

Solubility data can be obtained from a variety of online and offline sources. As with all chemical data, it can be a challenge to evaluate reported measurements. Some databases offer no references while others provide citations to peer reviewed journal articles. Given the choice, more weight is generally given to the latter. This is reasonable in most cases because more information about the purity of compounds and the methods used are available in peer-reviewed articles.

However, the information for how a specific measurement was obtained within a journal article is not generally provided. General methods are provided but the raw data for a specific measurement are typically not published. Peer review is not intended to validate individual measurements - its function is to ensure that the authors made appropriate conclusions based on their processed datasets and the state of knowledge in the field.

The Open Notebook Science Challenge was initiated in the fall of 2008 as the result of a discussion on a train in the UK between Jean-Claude Bradley and Cameron Neylon.[1,2] The concept was very simple: create a crowdsourcing opportunity for the chemistry community to contribute solubility measurements under Open Notebook Science conditions. This method of publication entails providing immediate public access to the chemist's laboratory notebook, as well as all raw data used to compute the measurements.[3,4]

On Sept 3, 2008 the first ONSC measurements were recorded by Bradley and Neylon at the University of Southampton in Neylon's laboratory.[5] The project was soon sponsored by Submeta, offering ten $500 awards for students in the US or the UK who best recorded how they performed their experiments.[6] Furthermore, the first 3 winners also received one year subscriptions to Nature magazine, thanks to a sponsorship from the Nature Publishing Group.[7] Sigma-Aldrich supported the contest by donating chemicals upon request.[8]

Students were evaluated by a group of judges who convened once a month to deliberate the next award. Judges also provided feedback to the students by commenting on their lab notebook pages directly on the wiki. Their expertise ranged from chemistry to mathematics, spectroscopy and molecular biology.

Techniques

Participants in the ONS Challenge were not required to use a specific method to measure solubility - although they were required to properly document their experiments and analyses. Due to its simplicity, most measurements in the past year were made using the SAMS NMR technique, requiring no volume measurement or calibration curves.[9] Two assumptions are made with this method. The first is that the volume of solute and solvent are additive, with the error becoming negligible at low solubility values. The second is that NMR integration values are proportional to the amount of solvent and solute. Some deviations from this have been observed for default NMR parameters and in later experiments long relaxation times are introduced into the protocol (D1 = 50s).[10]

Data Curation

Since an Open Notebook approach is used in this work, those interested in the validity of the measurements can assess the methods used - both for the preparation of saturated solutions and the raw data from the measurements. Over time, values in the database are likely to improve and possibly some errors may be uncovered and corrected. However, on the whole, we feel that the values provided in this work should be of use to chemists trying to gain an appreciation of solubility for most applications. This is especially the case for values that are not obtainable from any other source.

When clearly erroneous data points are discovered, they are flagged in the database as "DONOTUSE". This way interfaces with the dataset can ignore these values while allowing anyone to investigate why the data points were flagged. This might happen when early experiments did not allow for sufficient mixing or NMR D1 relaxation times were long enough to fully integrate peaks of interest. Out of 681 reported measurements, 51 are currently marked in this way. A shared Google Spreadsheet is used to collect and curate the dataset. This allows easy data entry while providing a simple way to interrogate the database for visualization applications via the Google API.[11]

Literature data and format conversions

An additional 400 solubility measurements from the literature are included in the database. These generally correspond to compounds that are structurally identical or similar to the compounds measured by the ONS Challenge participants. These values are averaged in with the values from the participants, with appropriate references provided. In order to compare values, conversions from molar fraction or g solute/100g solvent to molarity were made by assuming that the volumes are additive and obtaining the density of the solutes in most cases from the predicted values in ChemSpider.[12]

For the convenience of chemists with diverse applications, all three formats are provided. For the cases where solutes are miscible with the solvent, the molarity reported is simply the solute's density. The practical interpretation of this is that solutions of any molarity below the solute's density can be prepared.

In the process of converting units and averaging heterogeneous data sources, no attempt has been made to track significant figures. Those interested in any information about the precision of measurements should consult each individual data source. This may not be an easy task for measurements only carried out once and where factors such as the quality of spectral peaks and baselines are not optimal.

This collection will be most valuable for those who do not require highly precise measurements for their applications. For example, synthetic chemists can easily use rough estimates of solubility to select appropriate solvents for a reaction. In any case, one would be wise to consider all measurements as provisional, regardless of the source. As more data are collected, subsequent editions of this book will adjust values accordingly.

Searching the database

The values in this database can be accessed and filtered in various ways. More information is available at the ONS Challenge wiki[13] and Chapter 16 of the book "Beautiful Data".[14]

Database version

Archived as Excel Spreadsheet by WebCite on December 11, 2009.[15]

References

[1] Bradley, JC Open Notebook Science Challenge, UsefulChem blog (2008) http://usefulchem.blogspot.com/2008/09/open-notebook-science-challenge.html
[2] Open Notebook Science Challenge Wikipedia entry http://en.wikipedia.org/wiki/Open_Notebook_Science_Challenge
[3] Bradley, JC Open Notebook Science, Drexel CoAS E-Learning Blog (2006) http://drexel-coas-elearning.blogspot.com/2006/09/open-notebook-science.html
[4] Open Notebook Science Wikipedia entry http://en.wikipedia.org/wiki/Open_Notebook_Science
[5] Bradley, JC; Neylon, C UsefulChem Experiment 207 http://usefulchem.wikispaces.com/Exp207
[6] Bradley, JC Submeta Open Notebook Science Awards, UsefulChem Blog (2008) http://usefulchem.blogspot.com/2008/11/submeta-open-notebook-science-awards.html
[7] Bradley, JC Nature Sponsors Open Notebook Science, UsefulChem Blog (2008) http://usefulchem.blogspot.com/2008/11/nature-sponsors-open-notebook-science.html
[8] Bradley, JC Sigma-Aldrich First Official Sponsor of Open Notebook Science Challenge, UsefulChem Blog (2008) http://usefulchem.blogspot.com/2008/09/sigma-aldrich-first-official-sponsor-of.html
[9] Bradley, JC Semi-Automated Measurement of Solubility, UsefulChem Blog (2009) http://usefulchem.blogspot.com/2009/03/semi-automated-measurement-of.html
[10] Bradley, JC NMR Integration Progress for Solubility Measurements, UsefulChem Blog (2009) http://usefulchem.blogspot.com/2009/06/nmr-integration-progress-for-solubility.html
[11] Bradley, JC Interactive Visualization of ONS Solubility Data, UsefulChem Blog (2009) http://usefulchem.blogspot.com/2009/01/interactive-visualization-of-ons.html
[12] ChemSpider database http://www.chemspider.com
[13] ONS Challenge List of Experiments Page http://onschallenge.wikispaces.com/list+of+experiments
[14] Bradley, J.-C.; Guha, R.; Lang, A.S.I.D.; Lindenbaum, P; Neylon, C.; Williams, A.J. & Willighagen, E. Chapter 16: Beautifying Data in the Real World from Beautiful Data. O'Reilly Media, Eds: Segaran, T. & Hammerbacher, J. (2009)
[15] Bradley, Jean-Claude; Lang Andrew. Solubilities Summary Sheet. Open Notebook Science Challenge. 2009-12-11. URL:http://spreadsheets.google.com/pub?key=plwwufp30hfq0udnEmRD1aQ&output=xls. Accessed: 2009-12-11. (Archived by WebCite® at http://www.webcitation.org/5lx5ry3BV)


Labels: , , , , ,

Tuesday, December 01, 2009

Hai Truong is Dec09 Submeta ONS Award Winner

Hai Truong, working under the supervision of Jean-Claude Bradley at Drexel University, is the December 2009 Submeta Open Notebook Science Challenge Award winner. He wins a cash prize from Submeta.

Hai mainly collaborated with Khalid Mirza to try to understand co-solute effects for Ugi products in benzene. See his experiments here:
http://onschallenge.wikispaces.com/list+of+experiments

This was the final Submeta ONS Award for 2008-9. We would like to thank all the sponsors - Submeta, Nature Publishing Group and Sigma-Aldrich - for making this project a reality. A summary of the results from the past year will be published shortly.

For more information see:
http://onschallenge.wikispaces.com
http://onschallenge.wikispaces.com/submetaawards08

Labels: , , , ,

Thursday, October 15, 2009

NERM 09 session on Chemistry on the Web

Last week, on October 9, 2009 I presented at the ACS NERM conference. Martin Walker hosted a session on Publishing and Promoting Chemistry in the Internet Age. All of the talks were quite interesting and fit perfectly with the topic:
Martin Walker Chemistry on the Internet
Elizabeth Brown The Chemist's Toolkit for Publishing and Promoting Your Work On the Internet
Antony Williams Navigating the Complex Web of Chemistry Using ChemSpider
Jean-Claude Bradley Leveraging Transparency and Crowdsourcing in Chemistry Using Open Notebook Science
My talk consisted of an overview of Open Notebook Science with some new content on solubility prediction algorithms written by Andrew Lang and a few example of students taking a Chemical Information Retrieval class at Drexel University using research logs on a wiki to flesh out their projects.



Labels: , , , , , ,

Monday, September 28, 2009

A First General Solubility Model from ONS Challenge Data

After about a year, the Open Notebook Science Solubility Challenge has resulted in over 680 measurements, with about an additional 100 from the literature. Taking into account averaged repeated measurements, discarding some erroneous results and considering only organic solids (so far all of our liquid solutes have proven to be miscible in our solvents), that leaves us with 244 unique values.

Andrew Lang has created a general model (Model003) to predict solubility based on molecular descriptors of both the solutes and solvents. Previous models, such as Rajarshi Guha's Model002 were built only for selected solvents.

Predictions can be made from this web page by entering the SMILES of the solute and optionally the SMILES, dipole moment and dielectric constant of any solvent (convenient sources for these are Wolfram Alpha and Wikipedia). Boc-glycine with diethyl ether as an optional solvent is shown here.
The prediction service then looks up the relevant molecular descriptors from the CDK and makes predictions for some common solvents and the optional one if requested.

If the name of the solute was entered, the service will also report all of the experimental measurements for that solute from the ONS Challenge with links to the lab notebook pages.

There are a few objectives in making this public.

First, we think that it might provide some ideas about possible good or bad solvents for a given solute. The dataset is certainly not large enough to provide a truly general prediction of solubility in absolute terms. However, comparing relative values might be helpful in many cases. In the example above for boc-glycine, the model predicts that toluene would be the poorest solvent, which matches the order of the experimental values, even if the absolute values are not a close match. DMSO, THF, methanol and ethanol are predicted to be good solvents and this is reflected in the measurements.

Second, we want to make the model and data public so that other researchers with experience in this area can contribute their own models. We have been working with Marcin Wojnars from TunedIT to make it much easier for models to be submitted. Andy has just converted our dataset to ARFF format and it is available here. We should have more to report on this shortly.

By using molecular descriptors from the solvents we should be able to do predictions for solvent mixtures as well. At some point perhaps we can even include temperature.

The current model fits measurement with this type of distribution:
If we are able to build models automatically in real time after the addition of each data point, we should be able to set up automatic solubility measurement requests to minimize the amount of work it takes to improve each model. This is a step in that direction.

Labels: , , , ,

Wednesday, September 02, 2009

Jenna Mancinelli is Sept09 Submeta ONS Award Winner

Jenna Mancinelli, working under the supervision of Jean-Claude Bradley at Drexel University, is the September 2009 Submeta Open Notebook Science Challenge Award winner. She wins a cash prize from Submeta.

Jenna used both NMR and the sequential precipitation technique to obtain solubility data. See her experiments here:
http://onschallenge.wikispaces.com/list+of+experiments

One more Submeta ONS Award will be made during 2009. Submissions from students in the US and the UK are still welcome.
For more information see:
http://onschallenge.wikispaces.com
http://onschallenge.wikispaces.com/submetaawards08

Labels: ,

Monday, August 17, 2009

My first talk at ACS09 fall meeting on Crowdsourcing Solubility and ONS

Yesterday (August 16, 2009) I gave my first talk at the ACS meeting in Washington. It was part of an outstanding session on Chemical Text Mining and Public Molecular Databases, organized by Antony Williams and Alex Tropsha.
9:00 AM 1 U.S. EPA computational toxicology programs: Central role of chemical-annotation efforts and molecular databases
Ann M. Richard, Maritja A. Wolf, ClarLynda R. Williams-Devane, Richard Judson
9:25 AM 2 Linking public and commercial chemical data: ChemSpider and SureChem
Nicko Goncharoff
9:50 AM 3 Building an integrated system for chemistry markup and online publishing integrated to online chemistry resources
A J Williams
10:30 AM 4 Turning mining inside out
Colin R Batchelor
10:55 AM 5 Chemreader: A tool for extracting chemical structure information from digital raster images
Jungkap Park, Kazu Saitou, Kerby Shedden, Gus R. Rosania
11:20 AM 6 Exploiting a hidden treasure: Automated chemical entity recognition in Chemisches Zentralblatt
Valentina Eigner-Pitto, Heinz Saller, Peter Loew
1:30 PM 12 Online chemical modeling environment: database
Sergii Novotarskyi, Iurii Sushko, Robert Körner, Anil Kumar Pandey, Igor V. Tetko
1:55 PM 13 Public molecular databases: How can their value be increased by generation of additional data in silico?
Vladimir V. Poroikov, Dmitry Filimonov, Marc C. Nicklaus
2:20 PM 14 Chemical space management of large libraries for new active small molecules selection for prostate cancer treatment
Andrew V. Scorenko, Andrei A. Gakh, Andrey V. Sosnov, Mikhail Yu. Krasavin
2:45 PM 15 Crowdsourcing nonaqueous solubility and synthesis using Open Notebook Science
Jean-Claude Bradley, Khalid Mirza, Rajarshi Guha, Andrew Lang, A. Williams
3:25 PM 16 ChemXSeer: A cyberinfrastructure for environmental chemical kinetics
Karl T. Mueller, William J. Brouwer, C. Lee Giles, Prasenjit Mitra, Carl Lagoze
4:15 PM 18 Reliable reactions and stable structures
Jonathan M Goodman
Many of the presentations highlighted the use of ChemSpider or full collaborations (such as the integration with SureChem patent data). The acquisition of ChemSpider by RSC was repeatedly discussed and this seems to have accelerated such collaborative projects. Colin Batchelor from the RSC provided a great talk on their approach of using ontologies to better leverage the power of chemistry publications. [The presentations were judged and Colin won first prize - I won second, which was pretty cool :) and won me a ticket to the CINF lunch on Tuesday]

I also got to meet Gus Rosania in person for the first time. We had met via the blogsphere a while back over our interests in malaria and Open Notebook Science. Gus was there to share his results from ChemReader, a software package he developed to automatically read chemical structures from images.

I started my presentation by detailing the recent events surrounding the report of the oxidation of secondary alcohols using NaH. The timing of this was perfect because it really showed how useful it can be to immediately share the full data of experiments. This is the type of thing that would have been extremely helpful during the initial reports of Cold Fusion but the tools for sharing in such a detailed way were just not available. Carmen Drahl just wrote an article about this for the August 17, 2009 issue of Chemical & Engineering News (subscriber access).



Labels: , , , ,

Creative Commons Attribution Share-Alike 2.5 License