This blog chronicles the research of the UsefulChem project in the Bradley lab at Drexel University. The main project currently involves the synthesis of novel anti-malarial compounds. The work is done under Open Notebook Science conditions with the actual detailed lab notebook located at usefulchem.wikispaces.com. More general comments posted here relate to Open Science, especially when associated with chemistry.
Friday, November 16, 2012
Matthew McBride wins Nov 2012 ONS Challenge Award
Matthew McBride, an undergraduate chemistry major at Drexel University working in the Bradley Laboratory, was awarded the November 2012 Open Notebook Science Challenge Award sponsored by the Royal Society of Chemistry. ChemSpider founder Antony Williams presented Matt the award on behalf of the RSC. Matt is exploring the synthesis and solubility characteristics of dibenzalacetone derivatives.
One of my former Ph. D. students, Patrick Ndungu (now at University of KwaZulu Natal, South Africa) will be speaking at Drexel University on Friday August 19, 2011 at 12:30 in Disque 109.
Some Interesting Perspectives on the Integration of Nanomaterials with Energy and Water Treatment Technologies
As part of various key concerns in a developing economy, clean energy and access to potable water are an integral part of most strategic visions for sustainable socio-economic development. Of particular interest is the search for greener energy solutions that includes R&D into hydrogen energy technologies, and devices that utilize solar energy. Whilst clean water concerns centre on indigenous, cost-effective, and relatively simple technologies that can be easily deployed in remote or off-grid areas. Within this framework, this presentation will look at the evolution of a select body of work that has focused on the integration of carbon Nanomaterials into systems for hydrogen storage, fuel cells, and photo-catalytic materials for water treatment.
Andrew Lang will be in Philadelphia next week and we will be running a workshop on Leveraging Google Spreadsheets with Scripts for Research and Teaching. Now that our institution is no longer providing Microsoft Office for students in the fall term, it seems like a good time to explore converting some assignments and projects relying on Excel to freely available Google Spreadsheets. (Resources available here)
Andrew Lang (Department of Mathematics at Oral Roberts University) and Jean-Claude Bradley (Department of Chemistry at Drexel University) will host a workshop on Google Apps Scripts from 10:30 to 12:00 on Tuesday August 23, 2011 at the Hagerty Library in room L13C. They will demonstrate how users with no programming experience can easily add functions and drop-down menus to a Google Spreadsheet. Some chemistry examples will be detailed, such as inter-converting compound identifiers (common name, SMILES, CAS number, etc.) and reporting properties (melting points, solubility, density, etc.) with a single click. Participants are encouraged to suggest applications in other fields to explore during the workshop.
Even though we have melting points for about 20,000 unique compounds, most of these are from single sources. Unless we can get another major donation of melting points (not using any of the sources we already have), progress in curating single values manually will take time.
As described in the abstract:
This book represents a PDF version of Dataset ONSMP029 (2706 unique compounds, 7413 measurements) from a project to collect and curate melting points made available as Open Data. This particular collection was selected from the application of a threshold to favor the likelihood of reliability. Specifically, the entire range of averaged values for a data point was set to 0.01 C to 5 C, with at least two different measurements within this range. Measurements were pooled and processed from the following sources: Alfa Aesar, MDPI, Bergstrom, PhysProp, DrugBank, Bell, Oxford MSDS, Hughes, Griffiths and the Chemical Information Validation Spreadsheet. Links to all the information sources and web services are available from the Open Melting Point Resource page: http://onswebservices.wikispaces.com/meltingpoint
This filtering of double validated melting point measurements within a range of 5C is an attempt to provide a "reasonably" good source, It is imperative to understand that this is not a "trusted source" - as I've mentioned several time there is no such thing. However, since absolute trusted sources do not exist, this double validated dataset of 2706 compounds is probably the best we can do for now. In fact, use of this double validated to build melting point model has led to some excellent models, which are far superior to models constructed from the entire database of 20,000 compounds.
Rapid analysis of melting point trends and models using Google Apps Scripts
I recently reported on how Google Apps Scripts can be used to facilitate the recording and calculations associated with a chemistry laboratory notebook. (also see resource page)
I will demonstrate here how these scripts can be used to rapidly discover trends in the melting points of analogs for the curation of data and the evaluation of models. The two melting point services that Andrew Lang created under the gONS menu were used to keep track of the measured and predicted melting points for all reactants and product as part of a "dashboard view" of the reaction being performed.
For looking at melting point trends, the following template sheet can be used.
For reasons explained previously, the template sheet has no active scripts in the page (except for the images). These are just the values generated from running the scripts corresponding to the column headings on the common names. In order to use for another series of compounds just make a copy of the entire Google Spreadsheet (File->Make a Copy) then enter the new list and pick the desired script to run from the menus. Once the values are computed remember to copy and paste as values.
It is important to understand that our melting point service is not a "trusted source" - it simply reports the average of all recorded data sources, ignoring values marked as DONOUSE. That means that not all data points are equal and it is up to the user to determine a threshold of some type to decide how to use a particular data point.
In this investigation, I have marked in green averaged experimental values where at least 3 different values are clustered within a few degrees. A link in column H is automatically generated from the CSID to provide a very convenient way to evaluate the data sources. For example the link for methanol has 3 very close but different melting point values: -98 C, -97.6 C and -97.53 C. The -98 C value is repeated 7 times because this resulted from the automatic merging of several Open Collections.
In general we don't manually add values that are identical from different sources because it is likely that these all originate from the same source. We have to make that assumption because proper data provenance is usually lacking in chemical information sources today. A Google search will often return the same one or two melting points from dozens of sites, which may turn out to be an outlier when compared with other independent sources. (CAS numbers are generated in the template sheet because they are useful for searching Google for melting points - for example see here for methanol)
In another scenario where there are 3 or more different but close values and a few clear marked outliers, I considered these averages as having passed my threshold and colored these green as well. A good example is ethanol, which I have previously used to illustrate our curation method.
It turns out that for the series of n-alcohols from methanol to 1-decanol, I was able to mark in green every experimental melting point average, making the confidence level of the following plot about as high as it can get from current chemical information sources.
It is particularly gratifying to note that the predicted melting points based on Andrew Lang's random forest Model002 perform very well here, even predicting a melting point minimum at 3 carbons. Note that this model is Open Source and uses Open Descriptors derived from the CDK. It does not yet include the results of our most recent curation efforts. Any new models incorporating improved datasets will be listed here.
Extending the analysis to n-alkyl carboxylic acids from formic acid to decanoic acid provides the following plot, with the same confidence for the experimental averages.
For this series, the random forest model not only predicts that the lowest melting point is for the 5 carbon analog but it also appears to take the shape of a zig-zag pattern, especially for the first 6 acids. Since this alternating pattern has been attributed to the way that carboxylic acid dimer bilayers pack in 3D (Bond2004), it is hard to imagine how simple 2D descriptors from the CDK can predict this. We will have to investigate this in more detail.
More generally, molecular symmetry can greatly affect the melting point via the way that crystals pack in 3D (see Carnelley's Rule, Brown2000). At some point we would like to incorporate this factor in our models. The current model should not be able to make predictions based on symmetry or stereochemistry.
We can also explore the melting point patterns of cyclic systems. Going from cyclopropane to cyclohexane there is a large jump from a 5 to a 6 membered ring and this is roughly reflected in the model:
Cycloalkanones behave similarly to cycloalkanes, showing a jump from 5 to 6 membered rings which agrees well with the model going from cyclobutanone to cyclohexanone:
These examples show that provenance information is a critical dimension in the analysis of trends in melting point data. The Google Apps Scripts and associated Google Spreadsheet template presented here offer a quick and convenient way to provide access to both averaged values and a way of assessing confidence in an averaged value. Performing these tasks manually is generally too time-consuming to encourage researchers to follow such a practice. This is perhaps the reason that the current peer-review process accepts a single "trusted source" in analyses of this kind, even though such a practice inevitably leads to mis-interpretations and errors that cascade through the scientific literature.
The most problematic aspect of Google Apps Scripts running within Google Spreadsheets turns out to be the way caching and refreshing operate. There does not appear to be an obvious way to refresh a single cell. So if a script times out or fails, Google stores that failed output on their servers and will not run it again until some time has elapsed (which seems to be on the order of about an hour). Typing in a new input for that cell will cause the script to run again but entering a previously entered input will only retrieve the cached output, even a failed output. For example, if you have a cell calculating the MW from "benzene" entered in another cell and the script fails for any reason, typing in "ethanol" will get it to run again for the new input, but going back to "benzene" will just pull up the cached output of "Failed".
Nevertheless, I did come across some tricks to force a refresh indirectly. If you insert a row or column then re-enter the desired scripts in the new cells, they will run again. You simply need to then delete the old column with failed outputs. This is fine for simple sheets but it can be a headache for sheets that have several calculation dependencies between cells.
To avoid these complications, simply refresh the entire sheet by duplicating it, deleting the old sheet and then renaming the new one to the original name. The problem now is that it will refresh all the cells, not just those that had failed outputs. And if there are a large number of scripts on that sheet the odds are good that at least one will fail on that particular attempt, especially if several are hitting the same web server.
As a result of all these problems, I would not recommend using these services as I had initially hoped, where a researcher would enter data into a template sheet loaded with scripts to automatically generate a series of calculated outputs. There is a way to achieve this end but it requires thinking about the scripts in a slightly different way.
As I mentioned above, there are tricks for refreshing an entire sheet or a column or row. In order to avoid re-running the scripts that already returned desired outputs, we need to lock them in. This can be done by highlighting the completed cells, copying them (either control-c or Edit->Copy) then pasting them as values (from the Edit menu). Now refreshing will only be done on the cells with failed outputs and these can be locked in as well as soon as they complete.
The downside of this approach is that you lose the information about which script was run to generate the output values. And to change an input requires re-selecting the desired script. But in practice it is so convenient to hit a dropdown menu and hit getMW (for example) that this downside is quite minimal, especially when contrasted with the upside of knowing that others will see your information reliably, independent of how the services are running at a particular time.
Over the past few weeks we have found that some services fail more often than others and it would be advantageous to have some redundancies. This has been particularly problematic for the cactus services recently, which we often use for resolving common names. By using ChemSpiderIDs (CSIDs), the cactus services can be bypassed for several of the gONS services. So a good practice for any application is to generate and lock in SMILES and CSIDs right away from the common name. CAS numbers can be used too but the gChem service that Rich has created sometimes yields multiple CAS numbers and these will fail as input for a subsequent script.
We now have a chemistry Google Apps Scripts spreadsheet to keep track of which inputs are allowed for all the available services, along with information about the output, creator and description. We also keep track of requests and plans for new scripts, marked as "pending" under the status field.
Surprisingly, pasting images "as values" within a Google Spreadsheet cell does not ensure that they will appear consistently - often the cells are just blank upon loading. This makes the idea of using an embedded sheet to display reaction schemes within a wiki lab notebook page not practical. However, using the scripts and a template to generate the scheme by just typing the name, SMILES or CSID for the reactants and product is a very efficient way to generate a consistent look for schemes within a notebook. It only requires a final step of taking the image of the screen and cropping using Paint. For example, here is a scheme thus generated for UC-EXP269.
Taking into account all of these factors, the reaction template sheet we provide does not have by default any scripts running within cells (except for the images). However, it is set up to quickly adapt to other reactions for planning amounts of reactants (by weight or volume), calculating concentrations, yields, melting points (experimental and predicted), solubilities, links to ChemSpider, 2D rendering of structures (including full schemes) and links to interactive NMR spectra using ChemDoodle. It simply requires users to hit one of the 3 drop-down menus (gChem, gCDK or gONS) and select the appropriate script for a particular cell.
Even if the user does not want to use this particular reaction template it still makes sense to make a copy of the template sheet because it is an easy way to copy all of the necessary Google Script without opening the editor.
On April 6, 2011 I presented at the HUBzero Conference in Indianapolis on "Open Notebook Science: Does Transparency Work?".
This presentation will first describe Open Notebook Science, the practice of making the laboratory notebook and all associated raw data available to the public in real time. Examples of current applications in organic chemistry - solubility and chemical reactions - will be detailed. Key details of the current technical implementation will be described and possible applicability to nanotechnology projects will be explored. Finally, the implications for Intellectual Property protection, claims of priority, subsequent publication in peer reviewed journals and the eventual automation of the scientific process will be explored.
I learned a great deal at the conference about how researchers from various fields use the HUBzero software to manage and share their data. As described on their website:
HUBzero® is a platform used to create dynamic web sites for scientific research and educational activities. With HUBzero, you can easily publish your research software and related educational materials on the web.
Although the system is not primarily designed for completely Open sharing, I did get the impression that for some applications there was significant interest in making data and processes more Open. There is certainly an enthusiastic user community around HUBzero - check out the recordings for some of the other talks here.
With the information available thus far from our experiments (UC-EXP266), we think it is unlikely that the +4.6 C value can be correct because we observed no solidification after 2 days at -15 C. The patent reports that solidification of some viscous mixtures took up to a full week but we did not observe an appreciable increase in viscosity for 4-benzyltoluene at -15 C. But in order to be sure we will first freeze the sample again below -40 C and let it warm up to -15 C in the freezer and confirm that it melts completely.
But when we took the sample out of the freezer after 16 days it was completely frozen!
This now effectively ruled out the -30 C value and re-opened the possibility that the +4.6 C value could be the best estimate. Learning from our previous failed attempt to observe a temperature plateau when heating the sample, this time we let it warm as slowly as possible by leaving it in an ice water bath inside of a Styrofoam container. This worked much better as the sample warmed a few degrees over several hours. This time Evan observed a clear transition from the solid to the liquid phase in the 4-6 C range.(UC-EXP266)
The curation record for the melting point of 4-benzyltoluene now looks like this: When I introduce the concept of Open Notebook Science in my talks I usually make the point that there are no facts - just measurements embedded within assumptions.
The 4-benzyltoluene melting point story is a really good example of this principle. When I stated that I thought that "it is unlikely that the +4.6 C value can be correct because we observed no solidification after 2 days at -15 C", it was not the measurement that was in error - it was the interpretation. And when new information came to light, an experiment was proposed to either challenge or further support that interpretation. There were never any "facts" in this story (nor is the +4.6 C value a "fact" from these results).
I think that this is how science functions best and most efficiently. Unfortunately we don't usually have access to all pertinent raw measurements, assumptions and interpretations. I would be extremely interested in seeing how the -30 C value was determined. This is actually the value provided by the company that sold us this batch of material (as well as the PhysProp entry in the image above). Because of slow crystallization, I can see how this could happen if the temperature was dropped until solidification was observed. In our observations, the -30 C to -35 C range is roughly where we observed rapid solidification upon cooling. (UC-EXP266)
Google Apps Scripts for an intuitive interface to organic chemistry Open Notebooks
Rich Apodaca recently demonstrated how Google Apps Scripts can be added to Google Spreadsheets to enable simple calling of web services for chemistry applications (gChem). Although we have been using web service calls from within a Google spreadsheet for some time (solubility calculation by NMR link #3 and misc chem conversions link #1), the process wasn't as intuitive as it could be because one had to find then paste lengthy urls.
Rich's approach enables simply clicking the desired web service from a menu on Google Spreadsheets and these functions have simple names like getSMILES. Andrew Lang has now added several web services from our ONS projects and the CDK. There are now 3 menus to choose from: gChem, gCDK and gONS.
To demonstrate the power of these tools consider the rapid construction of a customized interface to an experiment in a lab notebook (in this example UC-EXP263).
1) Because Andy has added a gONS service to render images of molecules from ChemSpider, consistent reaction schemes can now be constructed from this template by simply typing the name of the reactants and products then embedding in the wiki.
2) Planning of the reaction to calculate reactant amounts and product yield can then be processed by simply typing the name of the chemicals. Services calling molecular weight and density are automatic based on the chemical name as input.
3) Typing the name of the solvent then allows easy access to the solubility properties of the reaction components. The calculated concentrations of the reactants and product can be directly compared with their measured maximum solubility. In this experiment the observed separation of the product from the solution is consistent with these measurements.
4) Both experimental and predicted melting points (using Model002) can then be lined up for comparison. A large discrepancy between the two would flag a possible error - in this case good agreement is found. Noting that the product's melting point is near room temperature (53 C) explains why two layers were were observed to form during the course of the reaction and cooling to 0 C induced the product to precipitate. Links to the melting measurements are also provided in column N for easy exploration.
About a year ago, I wrote about Mike Brown and the controversy about the discovery of Haumea stemming from a competitor's more aggressive data dissemination practice. In that post I speculated that we could expect accelerated data sharing over time due to the Open Science Ratchet, where the actions of scientists that are most open set the pace for everyone else working on that particular project, regardless of their views on how secretive science should be.
I don't know if Mike Brown has changed his views on data sharing - or if he has always felt this way but thought it was too risky until now. Either way, he certainly is taking the lead at this point to demonstrate how radical openness can be done in astronomy!
The first panel was on the "International Year of Chemistry: Perils and Promises of Modern Communication in the Sciences". My colleague Laurence Souder from the Department of Culture and Communications at Drexel presented on "Trust in Science and Science by Blogging", using as an example the NASA press release on arsenic replacing phosphorus in bacteria and subsequent controversy taking place in the blogosphere. (see post in Scientific American blog today)
The second panel was on "New Forms of Scholarly Communications in the Sciences". Don Hagen from the National Technical Information Service presented on "NTIS Focus on Science and Data: Open and Sustainable Models for Science Information Discovery" and Dorothea Salo discussed the evolving role of libraries and institutional repositories on scholarly communication and archiving.
More on 4-benzyltoluene and the impact of melting point data curation and transparency
There are many motivations for performing scientific research. One of these is the desire to advance public scientific knowledge.
This is a difficult concept to quantify or even qualitatively assess. One can try to use literature citations and impact factors but that captures only a small fraction of the true scientific impact. For example, one formal citation of our solubility dataset doesn't represent the 100,000 anonymous solubility queries made directly to our database. And of these the actual impact will depend on exactly how the information was used. Egon Willighagen has identified this as a problem for the Chemistry Development Kit (CDK) as well: many more people use the CDK than reflected simply by the number of citations to the original paper.
There are a few of us who believe that curating chemistry data is a high impact activity. Antony Williams spends a considerable amount of time on this activity and frequently uncovers very serious errors from a number of data sources. Andrew Lang and I have put in a similar effort in collecting and curating solubility measurements openly - and recently (with Antony) we have been doing the same for melting points.
Although attempting to estimate the total impact of the curation activity isn't really practical, we can look at a specific and representative example to capture the scope.
I recently exposed the situation with the melting point measurements of 4-benzyltoluene. In brief, the literature provided contradictory information that could not be resolved without performing an experiment. Although an exact measurement was not found, a limit was determined that ruled out all measurements except for one.
Ironically it turns out that the melting point of this compound is its most important property for industrial use! Derivatives of diphenylmethane were sought out to replace PCBs as electrical insulating oils for capacitors because of toxicity concerns. As described in this patent (US5134761), for this application one requires the oil to remain liquid down to -50 C. Another key requirement is the ability to absorb hydrogen gas liberated at the electrode surface (a solubility property). Since this is optimal for smaller alkyl groups on the rings, it places benzyltoluene isomers at the focal point of research for this application.
The patent states: "According to references, the melting points of the position isomers of benzyltoluenes are as follows..." but does not make a specific reference. However, by comparing the numbers with other sources we can presume that the reference is the Lemneck1954 paper I discussed previously.
The patent then uses these melting points to calculate the melting behavior of mixtures of these isomers, as they obtain without further purification from a Friedel-Crafts reaction.
If our results are correct and the melting point of 4-benzyltoluene is not +4.6 C but well below -15 C, then the calculated properties in the patent may be significantly in error as well. With the information available thus far from our experiments (UC-EXP266), we think it is unlikely that the +4.6 C value can be correct because we observed no solidification after 2 days at -15 C. The patent reports that solidification of some viscous mixtures took up to a full week but we did not observe an appreciable increase in viscosity for 4-benzyltoluene at -15 C. But in order to be sure we will first freeze the sample again below -40 C and let it warm up to -15 C in the freezer and confirm that it melts completely.
It is in light of this analysis that I make the case that open curation of melting point data is likely to be a high impact activity relative to the amount of time required to perform it. The problem is that errors such as these cascade through the scientific record and likely retard scientific progress by causing confusion and wasted effort. Consider the total cost in terms of research and legal fees for just one patent. As I discussed previously, consider the effect of compromised and contradictory data now known to exist within training sets on the pace of developing reliable melting point models (cascading down to solubility models dependent upon melting point predictions or measurements - and ultimately cascading to the efficiency of drug design).
It is important to note that the benefits of curation would be greatly diminished without the component of transparency. We are not claiming to provide a "trusted source" of melting point data. There is no such thing - and operating under the illusion of the trusted source model has resulted in the mess we are in now - with multiple melting point values for the same compound cascading and multiplying to different databases (a good and still unresolved example is benzylamine).
What we are doing is reporting all the sources we can use and marking some sources as DONOTUSE so they are not included in the calculation of the average - with an explanation. We never delete data so users can make informed choices and not be in a position of having to trust our judgement. If someone does not agree with me that failure to freeze after 2 days at -15 C does not necessarily rule out the +4.6 C value for the melting point for 4-benzyltoluene then they are free to use it.
Using a trusted source model, all values within a collection are equally valid. In the transparency model not all values are equal - we are justifiably more confident in a melting point value near -114 C for ethanol than for a melting point with a single source (like this compound).
And finally, an important factor for having an impact on science is discoverability. It is likely that someone doing research involving the melting behavior of 4-benzyltoluene would perform at least quick Google search. What they are likely to find is not just a simple number without provenance but rather a collection of results capturing the full subtlety of the situation under discussion. This is a natural outcome of working transparently.
The quest to determine the melting point of 4-benzyltoluene
I recently reported that we are attempting to curate the open melting point measurements collected from multiple sources such as Alfa Aesar, PhysProp (EPIsuite) and several smaller collections. I mentioned that some values - like benzylamine - simply don't converge and the only way to resolve the issue is to actually get a high purity sample and do a measurement.
Since that report, we found another non-converging situation with 4-benzyltoluene. As shown below, reported measurements range from -30 C to 125C.
The values in red have been removed from the calculation of the average based on evidence we obtained from ordering the compound from TransWorld Chemicals and observing its behavior when exposed to various temperatures. The details can be found from UC-EXP266 (which I performed with Evan Curtin).
Immediately after opening the package it was clear that the compound was a liquid and thus the 125C and 98.5C values became improbable enough to remove.
First Evan Curtin and I dropped the still sealed bottle into an ice bath (0C) and after 10 minutes there was no trace of solidification.
At this point, this does not necessarily rule out the values near 5C because of the short time in the bath.
We then used an acetone/dry ice bath and did see a rapid and clear solidification after reaching -30C to -35C.
Letting the bath temperature rise it was difficult to tell what was happening but there seemed to be some liquefaction around -12C.
In order to get a more precise measurement, we transferred about 2 mls of the sample into a test tube and introduced the thermometer directly in contact with the substance. After quickly freezing the contents in a dry ice/acetone bath, the sample was removed and its behavior was observed over time, as shown below.
I was expecting to see the internal temperature rise then plateau at the melting point until all the solid disappeared and then finally observe a second temperature rise. This comes from experience in making 0C baths within minutes by simply throwing ice into pure water.
As shown above that is not at all what happened. The liquid formed gradually starting at about -9C and never reached a plateau even up to +7C, where there was still much solid left.
If we look at the method used to generate the 4.58 C value (Lamneck1954) we find that a similar method was cited - but not actually described there. The actual curves are not available either. However, this paper provides melting points for several compounds within a series, which is often useful for spotting possible errors - unless of course these are systematic errors. In this particular case it doesn't help much because the 2-methyl derivative is similar but the 3-methyl analogue is very close to -30 C value listed in our sources.
Notice that one of the "melting points" (3-methyldicyclohexylmethane) is not even measurable because it forms a glass. It is easy to see how melting points below room temperature can generate very different values - and very difficult to assess if the full experimental details of the measurements are not reported.
Trying to get at more details lets look at the referenced paper (Goodman1950). Indeed the researchers determine the melting point by plotting the temperature over time as the sample is heated and looking for a plateau. The obvious difference is that the heating rate is about an order of magnitude slower than in our experiment. This paper also highlights the fact that there are more twists and turns in the melting point story. One compound (2-butylbiphenyl) was found to have 2 melting points that can be observed by seeding with different polymorphic crystals.
At this point, our objective of obtaining an actual melting point was replaced with trying to at least mark a reasonably confident upper limit. After leaving the sample at -15 C in a freezer for two days, no solidification was observed - not even an appreciable increase in viscosity. For this reason, all melting point values above -15C were removed from the calculation of the average and show up in red.
With only the -30 C measurement left, this is now the default value for 4-benzyltoluene - until further experimentation.
The question then is: why do QSPR models consistently perform significantly worse with regard to melting point? In the Introduction, we proposed three reasons for the failure of QSPR models: problems with the data, the descriptors, or the modeling methods. We find issues with the data unlikely to be the only source of error in Log S, Tm, and Log P predictions. Although the accuracy of the data provides a fundamental limit on the quality of a QSPR model, we attempted to minimize its influence by selecting consistent, high quality data... With regards to the accuracy of Tm and Log P data, both properties are associated with smaller errors than Log S measurement. Moreover, the melting point model performed the worst, yet it is by far the most straightforward property to measure...We suggest that the failure of existing chemoinformatics descriptors adequately to describe interactions in the crystalline solid phase may be a significant cause of error in melting point prediction.
Indeed, I have often heard that melting point prediction is notoriously difficult. This paper attempted to discover why and suggested that it is more likely that the problem is related to a deficiency in available descriptors rather than data quality. The authors seem to argue that taking a melting point is so straightforward that the resulting dataset is almost self-evidently high quality.
Since we have no additional information to go on (no spectral proof of purity, reports of heating rate, observations of melting behavior, etc.) the only way we can validate data points is to look for strong convergence from multiple sources. For example, consider the -130 C value for the melting point of ethanol (as discussed previously in detail). It is clearly an outlier from the very closely clustered values near -114 C.
This outlier value is now highlighted in red to indicate that it was explicitly identified to not be used in calculating the average. Andrew Lang has now updated the melting point explorer to allow a convenient way to select or deselect outliers and indicate a reason (service #3). For large separate datasets - such as the Alfa Aesar collection - this can be done right on the melting point explorer interface with a click. For values recorded in the Chemical Information Validation sheet, one has to update the spreadsheet directly.
This is the same strategy that we used for our solubility data - in that case by marking outliers with "DONOTUSE". This way, we never delete data so that anyone can question our decision to exclude data points. Also by not deleting data, meaningful statistical analyses of the quality of currently available chemical information can be performed for a variety of applications.
The donation of the Alfa Aesar dataset to the public domain was instrumental in allowing us to start systematically validating or excluding data points for practical or modeling applications. We have also just received confirmation that the entire EPI (PhysProp) melting point dataset can be used as Open Data. Many thanks to Antony Williams for coordinating this agreement and for approval and advice from Bob Boethling at the EPA and Bill Meylan at SRC.
In the best case scenario, most of the melting point values will quickly converge as in the ethanol case above. However, we have also observed cases where convergence simply doesn't happen.
One has to be careful when determining how many "different" values are in this collection. Identical values are suspicious since they may very well originate from the same ultimate source. Convergence for the ethanol value above is credible because most of the values are very close but not completely identical, suggesting truly independent measurements.
In this case values actually diverge into sources of either +10 C, - 10 C, -30 C or about -45 C. If you want to play the "trusted source" game, do you trust more the Sigma-Aldrich value at +10C or the Alfa Aesar value at -43 C?
Lets try looking at the peer-reviewed literature. A search on SciFinder gives the following ranges:
The lowest melting point listed there is the +10C value we already have in our collection but these references are to other databases. The lowest value from a peer-reviewed paper is 37-38 C.
This is strange because I have a bottle of benzylamine in my lab and it is definitely a liquid. Investigating the individual references reveals a variety of errors. In one, benzylamine is listed as a product but from the context of the reaction it should be phenylbenzylamine:
In another example, the melting point of a product is incorrectly associated with the reactant benzylamine: The erroneous melting points range all the way up to 280 C and I suspect that many of these are for salts of benzylamine, as I reported previously for the strychnine melting point results from SciFinder.
With no other obvious recourse from the literature to resolve this issue, Evan attempted to freeze a sample of benzylamine from our lab.(UC-EXP265)
Unfortunately, the benzylamine sample proved to be too impure (<85% by NMR) and didn't solidify even down to -78 C. We'll have to try again from a much more pure source. It would be useful to get reports from a few labs who happen to have benzylamine handy and provide proof of purity by NMR and a pic to demonstrate solidification.
As most organic chemists will attest, amines are notorious for appearing as oils below their melting points in the presence of small amounts of impurities. I wonder if the divergence of melting points in this case is due to this effect. By providing NMR data from various samples subjected to freezing, it might be possible to quantify the effect of purity on the apparent freezing point. I think the images of the solidification are also important because I think that some may mistake very high viscosity with actual formation of a solid. At -78 C we observed the sample to exhibit a viscosity similar to that of syrup.
La Science par Cahier de Laboratoire Ouvert à l'Acfas
On May 9, 2011 I presented remotely for the French-Canadian Association for the Advancement of Science (ACFAS). This was the first time I gave a talk about Open Notebook Science in French. In fact the last time I gave a scientific talk in French was probably in 1995, when I was doing a postdoc at the Collège de France in Paris. I remember being teased for my French Canadian accent back then so happily that wasn't an issue this time. Even though I was a bit rusty I think I managed to communicate the key points well enough. (At least I hope I did)
My presentation was a good fit for the theme of the conference: Une autre science est possible : science collaborative, science ouverte, science engagée, contre la marchandisation du savoir. (Another Science is possible: collaborative science, open science, against the commercialization of knowledge). I would like to thank the organizers (Mélissa Lieutenant-Gosselin and Florence Piron) for inviting me to participate.
I was able to record most of the talk (see below) but very near the end Skype decided to install an update and shut down so the recording ends somewhat abruptly. Given what people use Skype for, that default setting for updates really doesn't make much sense.
Breast Cancer Coalition talk on ONS and Taxol solubility
On May 1, 2011 I presented "Accelerating Discovery by Sharing: a case for Open Notebook Science" at the National Breast Cancer Coalition Annual Advocacy Conference in Arlington, VA. This was the first year where they had a session on an Open Science related theme and the organizers invited me to highlight some of the tools and practices in chemistry which might be applicable to cancer research.
I was really touched by the passion from those in the audience as well as the other speakers and conference participants I met afterward. For many, their deep connection with the cause was strongly rooted in a personal experience as breast cancer survivors themselves or their loved ones. Several expressed a frustration with the current system of sharing results from scientific studies. They felt that knowledge sharing is much slower than it needs to be and that potentially useful "negative" results are generally not disclosed at all.
The NBCC has ambitiously set 2020 as the deadline to end breast cancer (including a countdown clock). It seems reasonable to me that encouraging transparency in research is a good strategy to accelerate progress. Of course, great care must be exercised wherever patient confidentiality is a factor. But health care researchers are already experienced with following protocols to anonymize datasets for publication. Opting to work more openly would not change that but it might affect when and how results are shared. Also there is a great deal of science related to breast cancer that does not directly involve human subjects.
One initiative that particularly impressed me was The Susan G. Komen for the Cure Tissue Bank, presented by Susan Clare from Indiana University and moderated by Virginia Mason from the Inflammatory Breast Cancer Research Foundation. As a result of this effort, thousands of women have donated healthy breast tissue to create a comprehensive database richly annotated with donor genetics and medical history. The idea of trying to tackle a disease state by first understanding normal functioning in great detail was apparently somewhat of a paradigm shift for the cancer research community and it was challenging to implement. According to Dr. Clare, data from the Tissue Bank have shown that the common practice of using apparently unaffected tissue adjacent to a tumor as a control may not be valid.
This example highlights one of the key principles of Open Science: there is value in everyone knowing more - even if it isn't immediately clear how that knowledge will prove to be useful.
In my experience, this is a fundamental point that distinguishes those who are likely to favor Open Science from those who reject its value. If two researchers are discussing Open Science and only one of them views this philosophy as being self-evident the conversation will likely be about why someone would want (or not want) to share more and the focus will fall on extrinsic motivators such as academic credit, intellectual property, etc. If both researchers view this philosophy as self-evident the conversation will probably gravitate towards how and what to share.
I refer to this philosophy as being self-evident because I don't think people can become convinced through argumentation (I've never seen that happen). Within the realm of Open Notebook Science I have been involved in countless discussions about the value of sharing all experimental details - even when errors are discovered. I can think of a few ways in which this is useful - for example telegraphing a research direction to those in the field or providing data for researchers who study how science is actually done (such as Don Pellegrino). But even if I couldn't think of a single application I believe that there is value in sharing all available data.
A good example of this philosophy at work is the Spectral Game. Researchers who uploaded spectral data to ChemSpider as Open Data did not anticipate how their contribution would be used. They didn't do it for extrinsic motives such as traditional academic credit. Assuming that their motivation was similar to our group's, they did it because they believed it was an obviously useful thing to do. It is only much later - after a critical mass of open spectra were collected - that the idea arose to create a game from the dataset.
The first set involve the solubility behavior of biomolecules within the cellular environment. An example would be the observed increased solubility of gamma-tubulin in cancerous cells. The second type of results address the difficulty in preparing formulations for cancer drugs due to solubility problems. A good example of this is Taxol (paclitaxel), where existing excipients are not completely satisfactory - in the case of Cremophor EL some patients experience a hypersensitivity. Since our modeling efforts thus far have focused on non-aqueous solubility, there is possibly an opportunity to contribute by exploring the solubility behavior of paclitaxel. By inputting solubility data from a paper by Singla 2002 into our solubility database, Abraham descriptors for paclitaxel are automatically calculated and the solubilities in over 70 solvents are predicted.
Because of the way we expose our results to the web, a Google search for "paclitaxel solubility acetonitrile" now returns the actual value in the Google summary on the first page of results (currently 7th on the first page). The other hits have all 3 keywords somewhere in the document but one has to click on each link then perform a search within the document to find out if the acetonitrile solubility for paclitaxel is actually reported. (Note that clicking on our link ultimately takes you to the peer-reviewed paper with the original measurement.)
To be clear about what we are doing here - we are not claiming to be the first to predict the solubility of paclitaxel in these solvents using Abraham descriptors or any other method. Nor are we claiming that we have directly made a dent in the formulation problem of paclitaxel. We are not even indicating that we have done a thorough search of the literature - that would take a lot more time than we have had given the enormous amount of work on paclitaxel and its derivatives.
All we are doing is fleshing out the natural interface between the knowledge space of the UsefulChem/ONS Challenge projects and that of breast cancer research - AND - we are exposing the results of that intersection through easily discoverable channels. By design, these results are exposed as self-contained "smallest publishable units" and they are shared as quickly (and as automatically) as possible. The traditional publication system does not have mechanism to disseminate this type of information. (Of course when enough of these are collected and woven into a narrative that fits the criteria for a traditional paper they can and should be submitted for peer-reviewed publication).
Here is a scenario for how this could work in this specific instance. A graduate student (who has never heard of Open Science or UsefulChem, the ONS Challenge, etc.) is asked to look for new formulations for paclitaxel (or other difficult to solubilize anti-cancer agents). They do a search on commercial databases offered by their university for various solubilities of paclitaxel and cannot find a measurement for acetonitrile. They then do a search on Google and find a hit directly answering their query, as I detailed above. This leads them to our prediction services and they start using those numbers in their own models.
That is a good outcome - and that is exactly what has been happening (see the gold nanodot paper and the phenanthrene soil contamination study as examples). But the real paydirt would come from the graduate student recognizing that we've done a lot of work collecting measurements and building models for solubility and melting points, and contact us about a collaboration. As long as they are comfortable with working openly we would be happy actively work together.
I'm using the formulation of paclitaxel as an example but I'm sure that there are many more intersections between solubility and breast cancer research. With a bit of luck I hope we can find a few researchers who are open to this type of collaboration.
As another twist to this story, I will briefly mention here too that Andrew Lang has started to screen our Ugi product virtual library for docking with the site where paclitaxel binds to gamma-tubulin (D-EXP018). This might shed some light on some much cheaper alternatives to the extremely expensive paclitaxel and derivatives. The drug binds through 3 hydrogen bonds, shown below - rendered in 2D and 3D representations (obtained from the PDB ligand viewer) The slides and recording of my talk are embedded below: