Thursday, July 14, 2011

Practical Tips on using Google Apps Scripts for Chemistry Applications

A few weeks ago I described our use of Google Apps Scripts, developed by Rich Apodaca and Andrew Lang, as an intuitive interface to information related to a chemistry laboratory notebook. Since then we have been using these tools to actively plan and record experiments (e.g. UC-EXP269) and we have learned their strengths and weaknesses.

The most problematic aspect of Google Apps Scripts running within Google Spreadsheets turns out to be the way caching and refreshing operate. There does not appear to be an obvious way to refresh a single cell. So if a script times out or fails, Google stores that failed output on their servers and will not run it again until some time has elapsed (which seems to be on the order of about an hour). Typing in a new input for that cell will cause the script to run again but entering a previously entered input will only retrieve the cached output, even a failed output. For example, if you have a cell calculating the MW from "benzene" entered in another cell and the script fails for any reason, typing in "ethanol" will get it to run again for the new input, but going back to "benzene" will just pull up the cached output of "Failed".

Nevertheless, I did come across some tricks to force a refresh indirectly. If you insert a row or column then re-enter the desired scripts in the new cells, they will run again. You simply need to then delete the old column with failed outputs. This is fine for simple sheets but it can be a headache for sheets that have several calculation dependencies between cells.

To avoid these complications, simply refresh the entire sheet by duplicating it, deleting the old sheet and then renaming the new one to the original name. The problem now is that it will refresh all the cells, not just those that had failed outputs. And if there are a large number of scripts on that sheet the odds are good that at least one will fail on that particular attempt, especially if several are hitting the same web server.

As a result of all these problems, I would not recommend using these services as I had initially hoped, where a researcher would enter data into a template sheet loaded with scripts to automatically generate a series of calculated outputs. There is a way to achieve this end but it requires thinking about the scripts in a slightly different way.

As I mentioned above, there are tricks for refreshing an entire sheet or a column or row. In order to avoid re-running the scripts that already returned desired outputs, we need to lock them in. This can be done by highlighting the completed cells, copying them (either control-c or Edit->Copy) then pasting them as values (from the Edit menu). Now refreshing will only be done on the cells with failed outputs and these can be locked in as well as soon as they complete.

The downside of this approach is that you lose the information about which script was run to generate the output values. And to change an input requires re-selecting the desired script. But in practice it is so convenient to hit a dropdown menu and hit getMW (for example) that this downside is quite minimal, especially when contrasted with the upside of knowing that others will see your information reliably, independent of how the services are running at a particular time.

Over the past few weeks we have found that some services fail more often than others and it would be advantageous to have some redundancies. This has been particularly problematic for the cactus services recently, which we often use for resolving common names. By using ChemSpiderIDs (CSIDs), the cactus services can be bypassed for several of the gONS services. So a good practice for any application is to generate and lock in SMILES and CSIDs right away from the common name. CAS numbers can be used too but the gChem service that Rich has created sometimes yields multiple CAS numbers and these will fail as input for a subsequent script.

We now have a chemistry Google Apps Scripts spreadsheet to keep track of which inputs are allowed for all the available services, along with information about the output, creator and description. We also keep track of requests and plans for new scripts, marked as "pending" under the status field.


Surprisingly, pasting images "as values" within a Google Spreadsheet cell does not ensure that they will appear consistently - often the cells are just blank upon loading. This makes the idea of using an embedded sheet to display reaction schemes within a wiki lab notebook page not practical. However, using the scripts and a template to generate the scheme by just typing the name, SMILES or CSID for the reactants and product is a very efficient way to generate a consistent look for schemes within a notebook. It only requires a final step of taking the image of the screen and cropping using Paint. For example, here is a scheme thus generated for UC-EXP269.


Taking into account all of these factors, the reaction template sheet we provide does not have by default any scripts running within cells (except for the images). However, it is set up to quickly adapt to other reactions for planning amounts of reactants (by weight or volume), calculating concentrations, yields, melting points (experimental and predicted), solubilities, links to ChemSpider, 2D rendering of structures (including full schemes) and links to interactive NMR spectra using ChemDoodle. It simply requires users to hit one of the 3 drop-down menus (gChem, gCDK or gONS) and select the appropriate script for a particular cell.

Even if the user does not want to use this particular reaction template it still makes sense to make a copy of the template sheet because it is an easy way to copy all of the necessary Google Script without opening the editor.

Labels: , , ,

Friday, July 01, 2011

Open Notebook Science Talk at HUBbub 2011

On April 6, 2011 I presented at the HUBzero Conference in Indianapolis on "Open Notebook Science: Does Transparency Work?".
This presentation will first describe Open Notebook Science, the practice of making the laboratory notebook and all associated raw data available to the public in real time. Examples of current applications in organic chemistry - solubility and chemical reactions - will be detailed. Key details of the current technical implementation will be described and possible applicability to nanotechnology projects will be explored. Finally, the implications for Intellectual Property protection, claims of priority, subsequent publication in peer reviewed journals and the eventual automation of the scientific process will be explored.
The organizers did a great job in making the recording available as either a video or audio podcast.

I learned a great deal at the conference about how researchers from various fields use the HUBzero software to manage and share their data. As described on their website:
HUBzero® is a platform used to create dynamic web sites for scientific research and educational activities. With HUBzero, you can easily publish your research software and related educational materials on the web.
Although the system is not primarily designed for completely Open sharing, I did get the impression that for some applications there was significant interest in making data and processes more Open. There is certainly an enthusiastic user community around HUBzero - check out the recordings for some of the other talks here.

Labels: , ,

Tuesday, May 10, 2011

La Science par Cahier de Laboratoire Ouvert à l'Acfas

On May 9, 2011 I presented remotely for the French-Canadian Association for the Advancement of Science (ACFAS). This was the first time I gave a talk about Open Notebook Science in French. In fact the last time I gave a scientific talk in French was probably in 1995, when I was doing a postdoc at the Collège de France in Paris. I remember being teased for my French Canadian accent back then so happily that wasn't an issue this time. Even though I was a bit rusty I think I managed to communicate the key points well enough. (At least I hope I did)

My presentation was a good fit for the theme of the conference: Une autre science est possible : science collaborative, science ouverte, science engagée, contre la marchandisation du savoir. (Another Science is possible: collaborative science, open science, against the commercialization of knowledge). I would like to thank the organizers (Mélissa Lieutenant-Gosselin and Florence Piron) for inviting me to participate.

I was able to record most of the talk (see below) but very near the end Skype decided to install an update and shut down so the recording ends somewhat abruptly. Given what people use Skype for, that default setting for updates really doesn't make much sense.



Labels: ,

Sunday, May 08, 2011

Breast Cancer Coalition talk on ONS and Taxol solubility

On May 1, 2011 I presented "Accelerating Discovery by Sharing: a case for Open Notebook Science" at the National Breast Cancer Coalition Annual Advocacy Conference in Arlington, VA. This was the first year where they had a session on an Open Science related theme and the organizers invited me to highlight some of the tools and practices in chemistry which might be applicable to cancer research.

I was really touched by the passion from those in the audience as well as the other speakers and conference participants I met afterward. For many, their deep connection with the cause was strongly rooted in a personal experience as breast cancer survivors themselves or their loved ones. Several expressed a frustration with the current system of sharing results from scientific studies. They felt that knowledge sharing is much slower than it needs to be and that potentially useful "negative" results are generally not disclosed at all.

The NBCC has ambitiously set 2020 as the deadline to end breast cancer (including a countdown clock). It seems reasonable to me that encouraging transparency in research is a good strategy to accelerate progress. Of course, great care must be exercised wherever patient confidentiality is a factor. But health care researchers are already experienced with following protocols to anonymize datasets for publication. Opting to work more openly would not change that but it might affect when and how results are shared. Also there is a great deal of science related to breast cancer that does not directly involve human subjects.

One initiative that particularly impressed me was The Susan G. Komen for the Cure Tissue Bank, presented by Susan Clare from Indiana University and moderated by Virginia Mason from the Inflammatory Breast Cancer Research Foundation. As a result of this effort, thousands of women have donated healthy breast tissue to create a comprehensive database richly annotated with donor genetics and medical history. The idea of trying to tackle a disease state by first understanding normal functioning in great detail was apparently somewhat of a paradigm shift for the cancer research community and it was challenging to implement. According to Dr. Clare, data from the Tissue Bank have shown that the common practice of using apparently unaffected tissue adjacent to a tumor as a control may not be valid.

This example highlights one of the key principles of Open Science: there is value in everyone knowing more - even if it isn't immediately clear how that knowledge will prove to be useful.

In my experience, this is a fundamental point that distinguishes those who are likely to favor Open Science from those who reject its value. If two researchers are discussing Open Science and only one of them views this philosophy as being self-evident the conversation will likely be about why someone would want (or not want) to share more and the focus will fall on extrinsic motivators such as academic credit, intellectual property, etc. If both researchers view this philosophy as self-evident the conversation will probably gravitate towards how and what to share.

I refer to this philosophy as being self-evident because I don't think people can become convinced through argumentation (I've never seen that happen). Within the realm of Open Notebook Science I have been involved in countless discussions about the value of sharing all experimental details - even when errors are discovered. I can think of a few ways in which this is useful - for example telegraphing a research direction to those in the field or providing data for researchers who study how science is actually done (such as Don Pellegrino). But even if I couldn't think of a single application I believe that there is value in sharing all available data.

A good example of this philosophy at work is the Spectral Game. Researchers who uploaded spectral data to ChemSpider as Open Data did not anticipate how their contribution would be used. They didn't do it for extrinsic motives such as traditional academic credit. Assuming that their motivation was similar to our group's, they did it because they believed it was an obviously useful thing to do. It is only much later - after a critical mass of open spectra were collected - that the idea arose to create a game from the dataset.

With this mindset, I explored what contribution we might make to breast cancer research by performing a phrase search strategy. Doing a simple Google search for "breast cancer" solubility generated mainly two types of results.

The first set involve the solubility behavior of biomolecules within the cellular environment. An example would be the observed increased solubility of gamma-tubulin in cancerous cells.
The second type of results address the difficulty in preparing formulations for cancer drugs due to solubility problems. A good example of this is Taxol (paclitaxel), where existing excipients are not completely satisfactory - in the case of Cremophor EL some patients experience a hypersensitivity.
Since our modeling efforts thus far have focused on non-aqueous solubility, there is possibly an opportunity to contribute by exploring the solubility behavior of paclitaxel. By inputting solubility data from a paper by Singla 2002 into our solubility database, Abraham descriptors for paclitaxel are automatically calculated and the solubilities in over 70 solvents are predicted.

In addition, by simply adding the melting point of paclitaxel, we automatically predict its solubility at any temperature where these solvents are liquids (see for example water).

Because of the way we expose our results to the web, a Google search for "paclitaxel solubility acetonitrile" now returns the actual value in the Google summary on the first page of results (currently 7th on the first page). The other hits have all 3 keywords somewhere in the document but one has to click on each link then perform a search within the document to find out if the acetonitrile solubility for paclitaxel is actually reported. (Note that clicking on our link ultimately takes you to the peer-reviewed paper with the original measurement.)

To be clear about what we are doing here - we are not claiming to be the first to predict the solubility of paclitaxel in these solvents using Abraham descriptors or any other method. Nor are we claiming that we have directly made a dent in the formulation problem of paclitaxel. We are not even indicating that we have done a thorough search of the literature - that would take a lot more time than we have had given the enormous amount of work on paclitaxel and its derivatives.

All we are doing is fleshing out the natural interface between the knowledge space of the UsefulChem/ONS Challenge projects and that of breast cancer research - AND - we are exposing the results of that intersection through easily discoverable channels. By design, these results are exposed as self-contained "smallest publishable units" and they are shared as quickly (and as automatically) as possible. The traditional publication system does not have mechanism to disseminate this type of information. (Of course when enough of these are collected and woven into a narrative that fits the criteria for a traditional paper they can and should be submitted for peer-reviewed publication).

Here is a scenario for how this could work in this specific instance. A graduate student (who has never heard of Open Science or UsefulChem, the ONS Challenge, etc.) is asked to look for new formulations for paclitaxel (or other difficult to solubilize anti-cancer agents). They do a search on commercial databases offered by their university for various solubilities of paclitaxel and cannot find a measurement for acetonitrile. They then do a search on Google and find a hit directly answering their query, as I detailed above. This leads them to our prediction services and they start using those numbers in their own models.

That is a good outcome - and that is exactly what has been happening (see the gold nanodot paper and the phenanthrene soil contamination study as examples). But the real paydirt would come from the graduate student recognizing that we've done a lot of work collecting measurements and building models for solubility and melting points, and contact us about a collaboration. As long as they are comfortable with working openly we would be happy actively work together.

I'm using the formulation of paclitaxel as an example but I'm sure that there are many more intersections between solubility and breast cancer research. With a bit of luck I hope we can find a few researchers who are open to this type of collaboration.

As another twist to this story, I will briefly mention here too that Andrew Lang has started to screen our Ugi product virtual library for docking with the site where paclitaxel binds to gamma-tubulin (D-EXP018). This might shed some light on some much cheaper alternatives to the extremely expensive paclitaxel and derivatives. The drug binds through 3 hydrogen bonds, shown below - rendered in 2D and 3D representations (obtained from the PDB ligand viewer)


The slides and recording of my talk are embedded below:


Labels: , , , , ,

Collaboration using Open Notebook Science in Academia book chapter

I am very pleased to report that the book chapter that I co-wrote with Andrew Lang, Steve Koch and Cameron Neylon is now available online: Collaboration using Open Notebook Science in Academia. This is the 25th chapter of Collaborative Computational Technologies for Biomedical Research, edited by Sean Ekins, Maggie Hupcey, Antony Williams and Alpheus Bingham.

Our chapter provides some fairly detailed examples of how Open Notebook Science can be used to enhance collaboration between researchers from both similar or distant fields. It also suggests certain paths towards machine/human collaboration in science. Hopefully it will encourage researchers who have an interest in Open Science to experiment with some of the tools and strategies mentioned.

I am also grateful to Wiley for choosing our chapter as the free online sample for the book!
This book discusses the state-of-the-art collaborative and computing techniques for the pharmaceutical industry, the present and future implications and opportunities to advance healthcare research. The book tackles problems thoroughly, from both the human collaborative and the data and informatics side, and is very relevant to the day-to-day activities running a laboratory or a collaborative R&D project. It can be applied to help organizations make critical decisions about managing drug discovery and development partnership. The book follows a “man- methods-machine” format with sections on how to get people to collaborate, collaborative methods, and computational tools for collaboration. This book offers the reader a “getting started guide” or instruction on “how to collaborate” for new laboratories, new companies, and new partnerships, as well as a user manual for how to troubleshoot existing collaborations.


Labels: , , ,

Monday, January 17, 2011

Science Online 2011 Thoughts

On January 15, 2011 I co-moderated a Science Online 2011 session on Open Notebook Science with Antony Williams and Carl Boettiger. The projector failed so we did our best to introduce the topic without relying on visual aids. My main objective was to demonstrate that it is not necessary for researchers (or their machines) to interface with the actual lab notebook to benefit from the information generated from the work. By introducing simple and rapid abstraction steps, both solubility and reaction information can be converted to web services for a variety of uses. As long as a link to the original lab notebook page (including the raw data) is attached, no information is lost and details can be investigated on demand.

One of the most powerful tools to use in this context is the tracking of chemical entities as ChemSpider IDs. This enables direct access to many other web services which Andrew Lang and I have leveraged to generate our own services. Tony spoke a bit more about this in his part and outlined some of the benefits and frustrations with crowdsourcing. Carl spoke eloquently about his experiences with Open Notebook Science as a graduate student for computational projects. The slides from all of us are provided below.

The overall tone of the discussion during our session was quite positive and productive. This was the case with all of the other sessions I attended, as it has been in prior years. The Science Online conference has evolved to attract a large proportion of people advocating Open Science. The presenters and the audience feel that they are among friends and the result is usually a free and easy exchange of ideas. Not all conferences and symposia relating to the online aspects of science share this. I have seen many examples where the "online science" theme is overrun by Closed Science proponents, for example commercial databases or Electronic Laboratory Notebook (ELN) vendors. Hopefully this conference will retain its Open Science focus in the future.

Kaitlin Thaney proved to be a very effective moderator during her session on "The Digital Toolbox: What's Needed?" and she stirred up some insightful discussion. I also enjoyed Steve Koch's session (co-moderated with Kiyomi Deards and Molly Keener) on "Data Discoverability: Institutional Support Strategies". Steve shared a particularly compelling example of the collaborative benefits of Open Notebook Science, where a computational research group came across images and videos from one of his group's notebooks and incorporated these in their paper - with all due credit acknowledged.

I very much appreciated the opportunity to catch up with old friends and some new. I had never met Carl Boettiger in person before and we had some very interesting discussions about Open Science and Open Education. It was good to meet Mark Hahnel from FigShare and explore possible paths for data sharing. I had some nice chats with Antony Williams, Steve Koch, Steven Bachrach, Heather Piwowar and Ana Nelson.

The Saturday evening banquet proved to be surprisingly entertaining. Despite the sedate title of her talk, "Out on a Limb: Challenges of Training Scientists to Communicate", Meg Lowman pounded the audience with a hilarious performance. Science comedian Brian Malow kicked this up a notch with some very clever material. Later on, using a brilliant comedic judo technique, he repeated some choice derisive comments he received from his performances on YouTube. I hope he comes back next year!

Labels: , ,

Monday, December 20, 2010

Visualizing Social Networks in Open Notebooks

Increasing the role of automation in the scientific process has long been a fundamental objective of Open Notebook Science. The automatic discovery of new connections in open scientific work is potentially a very important contribution to this end.

Visualizing social networks within and between Open Notebooks is certainly a good first step. Luckily, our Reaction Attempts project has already abstracted the key elements of organic chemical reactions within a collection of Open Notebooks. This means that creating connection maps between people and chemicals can be attempted with reliable and semantically unambiguous database sources.

The Reaction Attempts database records the identity of reactants and products as ChemSpiderIDs for each reaction within a collection of notebooks. Also the name of the researcher, the solvent, the yield (when available) and a few more key identifiers are recorded.

We are very fortunate that Don Pellegrino, an IST student at Drexel, has selected the analysis of networks within Open Notebooks as part of his Ph.D. work. He has started to report his progress on our wiki and is eager to receive any feedback as the work progresses (his FriendFeed account is donpellegrino).

Don's first report is available here. He is using the Open Source software Gephi for visualization and has provided all of the data and code on the associated wiki page. (also see Tony Hirst's description of mapping ONS work which provided some very useful insights) Don has provided a detailed report of his findings but I think the most important can be seen in the global plot below.
This represents a map connecting people through chemicals. The large top right structure represents the connections within the UsefulChem project and the main circle represents the activity of graduate student Khalid Mirza who was the most active on this project. The crescent structure to the right of the circle represents other students - mainly undergraduates - who worked with the same chemicals as Khalid.

At the top left there are 3 isolated small networks, representing completely separate projects: the sodium hydride (NaH) oxidation study, Dustin Sprouse and Sebastian Petrik. I'll be posting about Sebastian's work in a future post.

Near the bottom middle there is another small network connected to the main group by a single link mediated by 2,2-dimethoxyethylamine.
This represents the overlap between Open Notebooks (Wolfle from Todd group and Mirza from Bradley group) that I mentioned previously.

I think that automatically discovering such connections as they occur could be a really useful outcome of this network analysis work. For example, the researchers could be alerted by email that a new potentially interesting overlap between their projects now exists. This could accelerate new collaborations.

A key challenge in Don's work is to figure out the right questions so that the results will be genuinely useful and novel to the researchers involved and the research community. I'm optimistic that he will succeed. As a separate outcome, just learning how researchers collaborate and record their work over time is bound to be interesting.

For a description of Don's planned work over the next several months take a look at his full Thesis Proposal: "Proposal of a System and Methods for Integrating Literature and Data".

Labels: , ,

Thursday, November 04, 2010

Nanoinformatics 2010 Conference Report

On November 3, 2010 I presented on "The implications of Open Notebook Science and other new forms of scientific communication for Nanoinformatics" at the Nanoinformatics 2010 conference.

The presentation first covers the use of the laboratory knowledge management system SMIRP for nanotechnology applications during the period of 1999-2001 at Drexel University. The exporting of single experiments from SMIRP and publication to the Chemistry Preprint Archive is then described followed by the evolution to Open Notebook Science in 2005. Abstraction of semantic structure from ONS projects in the areas of drug discovery and solubility is then detailed as an efficient mechanism to provide web services and machine readable data feeds.

This was a terrific opportunity to tie together my current ONS projects with my work in nanotechnology about 10 years ago, when the focus was to capture laboratory information in a structured format so that autonomous agent could begin to replace human workflows. I found it really interesting that the most active workflows back then were related to processing reference information. It took a team of students to find, photocopy and scan many of our key papers, with all the problems that come with training and managing new students. Today, obtaining relevant papers and extracting metadata is not so much of a challenge with tools like Mendeley. I ended the talk with a mention of our use of Mendeley tags to share dynamic links of article collections.

Another important development over the course of the past decade is the availability of free and hosted tools to easily communicate research. This includes wikis, blogs, Nature Precedings, institutional repositories, Google Spreadsheets and many others. It also includes some failed attempts like the Chemistry Preprint Archive.

I didn't anticipate in the late 90s just how crucial openness would prove to be for the evolution of the automation of the scientific process. It isn't my impression that there is currently a consensus on this point. Obviously it is possible to leverage automation in very clever ways for private use. But I think that exponential impact requires very low barriers to contribution (human or not) that can only be achieved with openness and transparency.

I have been very impressed with the ideas and projects discussed at this conference. Open sharing of nanotechnology data and integration of resources are clearly high priority items for many in this community.

As we have shown with our Reaction Attempts and ONS Solubility projects, abstracting meaningful semantic structure is necessarily field specific. One of the exciting opportunities to result from this meeting is finding ways to interconnect our solubility dataset with the nanotechnology community resources. I have met with a few people who would like to collaborate on this and I will be sure to report on our progress.


Labels: ,

Friday, October 15, 2010

Dynamic links to private tagged Mendeley collections

My close collaborators and I have been using Mendeley as a convenient way to share PDFs of journal articles. Not all of us have access to the same libraries so links are not enough - we need the full documents. We also use Dropbox as a redundancy but Mendeley allows tagging and recording notes, which is very handy for everyone in the group.

Now that Mendeley is providing an API, Andrew Lang has written code that significantly leverages the information in our private ONS collection. We can now create public links that return the most updated results for specific tags, including multiple tags (which I don't think you can do on Mendeley). For example the following link returns all articles in the ONS collection tagged with "science2.0" and "chemistry":
The results include available information from Mendeley, including the title, authors, journal citation, doi, url, tags and the abstract. Because this information is public the PDFs can't be provided but the hyperlinks make it as convenient as possible.


At the end of the report the full list of all available tags for the ONS collection is provided. A more refined or different search can be done immediately simply by checking boxes and hitting the submit button.Because the tags are controlled by the users of the private collection, these links can be useful when discussing an ongoing project and referring to a very specific topic. For example, we have been collecting examples of articles where a Ugi reaction is carried out and the product precipitates. This link provides an updated report on that very narrow topic:
http://showme.physics.drexel.edu/onsc/mendeley/?tags=Ugi+precipitate
There are still 2 major limitations to this service:

1) The search is very slow (can take a minute or two) because there is no way currently to use the Mendeley API to selectively return results based on tags. Every search requires initially returning all results for the collection (currently a few hundred).

2) Notes are currently not returned. If the API is updated to include these the usefulness would increase dramatically. For example in the results for the above query I took notes of the conditions involved in the Ugi precipitate for each paper. With the current format, one has to read each paper to find the relevant information.

Progress on our Mendeley related services will be posted on the ONSwebservices wiki.

Labels: , , ,

Thursday, August 26, 2010

Open Notebook Science in Drug Discovery at Opal Event

I presented on "Open Notebook Science in Drug Discovery" on August 24, 2010 at a panel on Industry and Academia part of the Opal Event "Drug Discovery: Easing the Bottleneck".
I only had about 15 minutes to present so I could not go into much detail but I did want to highlight the most recent work Andrew Lang and I (also with Peter Li from ChemTaverna) carried out involving solubility prediction and web services. Most of the attendees were from industry and I appropriately used the recent GSK malaria data sharing to introduce the talk. It is clear that there is a role for Open Science in drug discovery and I think that industry involvement will continue to increase in this area.

My co-panelist Rathindra Bose from Ohio University presented on his group's development of a novel cancer treatment compound based on platinum. He made the point that academic research complements that from industry by being able to explore more speculative hypotheses. The dominant hypothesis for the mechanism of action of platinum based drugs is binding with DNA. By exploring alternative scenarios, his group found an active platinum drug that does not bind with DNA.

During the preceding session on the Emergence of Biologics in Drug Discovery, Albert Giovanella from the University of Pennsylvania School of Medicine gave a particularly enlightening talk about comparing biologics with small molecule drugs. Although biological drugs tend to have less toxicity, the overall cost to bring them to market is still quite high and their cost to the consumer may be so high as to limit their impact. It looks like it will not be generally easy to translate new biomedical knowledge to a widespread impact on human health.

Labels: , , ,

Sunday, July 11, 2010

Secrecy in Astronomy and the Open Science Ratchet

Probably because of the visibility of the GalaxyZoo project, I think several of my colleagues and I have been under the impression that astronomy is a somewhat more open field than chemistry or molecular biology. It was easy to rationalize such a position because patents are not an issue, as they clearly are in fields which rely more on invention than discovery. However, after reading "The Case for Pluto" by Alan Boyle, I am left with a much different impression.

This book does an excellent job of covering the recent debate over Pluto's designation as a true planet. A key trigger for this debate has been the discovery of dwarf planets with sizes very close to that of Pluto. However, these discoveries did not occur without controversy.

The story of the controversy regarding the discovery of Haumea is a particularly good example (starts on p. 108 of the book - a good summary also on Wikipedia). Starting in December 2004 Michael Brown at Caltech discovered a series of new dwarf planets. Instead of immediately reporting his team's discoveries, he worked in secrecy until July 20, 2005 when he posted an online abstract indicating the discoveries would be announced at a conference that September. However, on July 27, 2005 a Spanish team led by José Luis Ortiz Moreno filed a claim with the Minor Planet Center for priority in discovering one of these dwarf planets. This forced Brown's hand in disclosing his team's other discoveries within days - much sooner than he had anticipated.

Apparently this stirred up a great controversy in the community and officially no name was associated with the discovery, although the Spanish team's telescope at Sierra Nevada Observatory was recognized as the location of the discovery. However, Brown was allowed to select the name Haumea for the dwarf planet.

Even though the Minor Planet Center accepted Moreno's submission, most reports seem to side with Brown. The main argument is no less than academic fraud on Moreno's part because he accessed public telescope logs and found some of Brown's data. It was as simple as Googling the identifier that Brown inserted in his public abstract.

If Moreno had hacked into a private computer from Brown's team I can understand fraud. But is it fraud to access public databases? We chemists do that all the time - reading abstracts from upcoming conferences to try to glean what our competitors are up to. That hasn't stopped anyone from submitting a paper or patent.

Secrecy only works if everyone competing follows the same rules. If there is a rule that planet discoveries must be made at conferences or by formal publication then this could not have happened. Moreno's submission to the Minor Planet Center should have been rejected if such a rule existed. If there is a rule that telescope logs should not be accessed then why make them public and indexed on Google?

Now there may exist field specific conventions. I don't know what they are in the case of discoveries such as these but here is an interesting quote from Michael Brown's Wikipedia page:
When asked about this online activity, Ortiz responded with an email to Brown that suggested Brown was at fault for "hiding objects," and said that "the only reason why we are now exchanging e-mail is because you did not report your object."[3] Brown says that this statement by Ortiz contradicts the accepted scientific practice of analyzing one's research until one is satisfied that it is accurate, then submitting it to peer review prior to any public announcement. However, the MPC only needs precise enough orbit determination on the object in order to provide discovery credit, and Ortiz et al. not only provided the orbit, but "precovery" images of the body in 1957 plates.
It seems to me that there is a clash of what are the conventions in the field. Certainly the Minor Planet Center did not recognize the convention of peer review before public disclosure. They only required sufficient proof for the discovery.

One way to look at this story is that Moreno acted more openly than Brown by disclosing information before peer review. This action forced Brown to disclose scientific results much more quickly than he had anticipated.

In a sense this is a type of Open Science Ratchet. The actions of scientists that are most open set the pace for everyone else working on that particular project, regardless of their views on how secretive science should be.

Imagine how the scenario would have played out if one of the groups had used an Open Notebook. On December 28, 2004 everyone with a stake in the search for planets would have had the opportunity to know that a very significant find had been made. There were still details to work out - and the Brown group might not be the first to do all the calculations to completely characterize the discovery. Certainly it would affect what other researchers did - even if they were completely opposed to the concept of Open Science.

Essentially secrecy in this context is an all-or-nothing gamble. Everyone is free to not disclose their work until after peer reviewed publication. In some cases the discoverer will get full credit for the discovery and the complete analysis. But in other cases another group working in parallel will publish first and leave nothing to claim.

As scientists become more open, it is likely that their ability to claim sole priority for all aspects of a discovery will be reduced. However, they will retain priority for the observations and calculations that they made first.

The more open the science, the faster it happens. And because of the Open Science Ratchet, a few Open Scientists scattered across various fields could have a larger hand than expected in speeding up science.

Labels: , , , ,

Monday, June 07, 2010

IGERT NSF panel on Digital Science

On May 24, 2010 I was part of a panel in Washington for the NSF IGERT annual meeting. As I mentioned previously, it is encouraging to find that funding agencies are paying more attention to the role of new forms of scholarship and dissemination of scientific information.

My co-panelists included Janet Stemwedel, who talked about the role of blogging in an academic career, Moshe Pritzker, who made a case for using video to communicate protocols in life sciences and Chris Impey, who demonstrated applications of clickers and Second Life in the classroom.

We only had 10 minutes each to speak so the presentations were basically highlights of what is possible. Still, it was enough to stimulate a vigorous discussion with the audience. There was a bit of controversy about the examples I used to demonstrate the limitations of peer review in chemistry. People can misinterpret what we are trying to do with ONS - it certainly doesn't include bringing down the peer review system (not that we could anyway). But we have to face the situation that peer review does not validate all the data and statements in a paper. It operates at a much higher level of abstraction. Providing transparency to the raw data should work in a synergistic way with the existing system.

My favorite part of the conference was easily Seth Shulman's talk on the "Telephone Gambit". Ever since reading his book, I have been using the story of how carefully reading Bell's lab notebook has forced us to revise the generally accepted notion of how the telephone was invented. Seth's presentation was truly captivating because he explained not only what was done but also what motives were at work to deceive and obfuscate. This cautionary tale is still very much relevant to science and invention today - and highlights how transparency can mitigate against this type of outcome.

Labels: , , ,

Tuesday, June 01, 2010

Use of ONS to protect Open Research: the case of the Ugi approach to Praziquantel

As we were collecting reactions from The Synaptic Leap for the Reaction Attempts project, Andrew Lang noticed that there might be a quick synthetic route to praziquantel via a Ugi reaction. I researched it further and found a paper (Kim et al 1998) where Ugi product 1 was indeed converted to racemic praziquantel via the Pictet-Spegler cyclization.


Using Beilstein Crossfire the only synthesis of 1 I found involves a multi-step amidation strategy. But this compound should be accessible in one step from commercially available starting materials via a Ugi reaction (shown above). Since all the starting materials are liquids we have some flexibility with solvent choice. Khalid first tried it in methanol EXP258 a few weeks ago but did not get a precipitate. He was going to monitor it by NMR next to see if the problem was high solubility of the Ugi product or with the reaction itself.

It was therefore with great interest that I read Mat Todd's report this morning on The Synaptic Leap that a German patent had been issued on this Ugi strategy to praziquantel. (TSL didn't provide a means of leaving a comment so I edited the page - which made me the author of that post but actually Mat wrote it)

I have often mentioned during my talks that Open Notebook Science could be used not only in a defensive manner to claim academic priority - but also as an offensive tactic to block patent applications. A company attempting to prevent the commercial exploitation of rival inventions has a few options. Where applicable, it can buy up an existing patent pool with the intention of sitting on it. For new inventions, it can do research and try to file patents before their competitors. But this is a costly process and it may make more sense to simply publish the inventions to create disclosed prior art, thereby blocking patent applications of their competitors.

But - as I and many others have discussed - the current publication system is not optimally suited for the purpose of simply disclosing and communicating science. Not only is it generally slow but the traditional article format requires a narrative of some sort - rarely can single experiments be published. This means that much (if not most) of research done by an individual or group will never be disclosed.

For these reasons I think that keeping an easily discoverable Open Notebook for projects designed to block patent submission by competitors makes a lot of sense - both economically and from a workflow perspective. Since researchers already have to keep a lab notebook, making it public doesn't impose the added time that writing an article or patent will require.

In this specific example of praziquantel we were too late. But if we had recorded this experiment a few years ago it might have worked to block Domling's patent. Now, it isn't clear to me that EXP258 would have been enough to do that. The strategy to make praziquantel via a Ugi reaction was clearly stated but the experiment was not conclusive. However, since Domling reported that methanol worked I am sure that we would have had the "reduced to practice" evidence in the notebook shortly.

Above I used a company as an example of a party motivated to disclose inventions to protect their interests. In our case it would not be a company but rather the entire Open Science community. It is in our best interest to keep our scientific territory as unencumbered by patents as possible. Keeping Open Notebooks might be one of the simplest means of ensuring that.

Consider a humanitarian organization that might want to manufacture praziquantel. I haven't researched it but presumably the Domling patent was filed in a number of countries beside Germany. In order to consider using the Ugi strategy, the organization would now have to deal with the patent holder. This might be the factor that makes this route untenable. Patents have proven to be problematic for humanitarian aid - even in the simple case of providing food.

But all is not lost. In addition to offering a simple 2-step synthesis of praziqantel, the Ugi route offers an easy way to make large libraries of analogs. Optimally we would like to work with someone who has experience with docking praziquantel. It might be interesting to screen not only the praziquantel analogs but also the uncyclized Ugi products themselves. When we did this for malarial enoyl reductase inhibitors (D-EXP005) we found that we did not need to cyclize to obtain compounds predicted to bind. This ultimately led to active compounds.

Labels: , , , , ,

Friday, May 07, 2010

The Scientist Article on Electronic Lab Notebooks

Amber Dance has written an article in The Scientist (2010-05-01) Digital Upgrade: How to choose your lab’s next electronic lab notebook. This is basically a quick overview of different Electronic Lab Notebooks (ELNs) that should be helpful for people researching what is currently available in that space.

There was some coverage of Open Notebook Science and Steve Koch and I were quoted. Ironically my contribution appeared in the "Cons" section :)

Pros

  • The format is unconstrained—you can set up any categories, and as many users and pages, as you want—and fast to set up.
  • Open notebooking attracts collaborators. Koch counts three collaborations that wouldn’t have happened if he weren’t on OpenWetWare. And his students build professional networks well before they author a paper.

Cons

  • Wikis were not designed with scientific data in mind. For example, it’s hard to make a table, Koch says.
  • Open notebook science “does limit where you can send your work,” says Jean-Claude Bradley, a chemist at Drexel University in Philadelphia, who also uses an open wiki notebook. His lab sticks to journals that accept preprints.
  • Posting online voids international patent rights, although US patents are still possible.
In my opinion, one of the biggest "Pros" wasn't listed in that section: the free cost. (That was mentioned elsewhere though) When you see the costs of some of these other commercial systems, that has to be a factor for many people trying to make a decision.

If privacy is an issue wikis can certainly be made private, although I'm not sure if that is possible on OpenWetWare. It can be done for $5/month on Wikispaces, the wiki we use for lab notebooks - although then it wouldn't be Open Notebook Science.

Concerning Steve's Con of wikis being difficult to use to store data, that is true. However combining the use of a wiki with Google Spreadsheets has completely resolved that issue for us. With our ability to automatically export an archive of the notebook (as HTML) and spreadsheets (as XLS) into an integrated archive, the two platforms operate essentially as if they were a single system.

Labels: , , , ,

Sunday, May 02, 2010

The Synaptic Leap Experiments on Reaction Attempts

Andrew Lang and I recently reported on the first edition of the Reaction Attempts book and database. Part of the motivation for this was to structure the experiments from the UsefulChem project in both a machine readable format and one that could be browsed as a physical copy. However, we also had in mind the easy integration of other open experiments, especially those labeled as "failed", since these are unlikely to be found by searching conventional reaction archives.

As a demonstration, we have added a series of experiments from The Synaptic Leap, which Michael Wolfle (working as a post-doc with Mat Todd) has posted. All of these reactions involve intermediates in the synthesis of praziquantel, which is a major focus of the Todd group. One group of these reactions involved the attempted synthesis of praziquanamine via a Pictet-Spengler cyclization. Most of these are failed attempts and one successful one.

Adding these experiments to Reaction Attempts was very simple - since the minimum information required is the ChemSpiderIDs (CSIDs) of all the reactants and the product, which a hyperlink to more details. We also added a few more details provided by Michael - such as the solvent, reaction conditions and outcome.

Andy has provided a simple mechanism to pull up all Reaction Attempts for a given reactant with the following url structure:
http://showme.physics.drexel.edu/onsc/databook/ucdatabook.php?reactants=9099925
The number at the end is the CSID for the reactant. Multiple reactants can be pulled from the database by adding more CSIDs separated by commas.

Successful runs in Reaction Attempts are identified with a green check mark:


Again the main idea here is not to exhaustively abstract all pertinent information for an experiment. Rather it is to connect up researchers who are working on similar reactions. Since it requires so little effort to come up with the minimum required information we are hoping to get contributions from other sources.

We will focus next on coming up with more sophisticated ways to retrieve information - such as substructure searching or by reaction type, solvent, etc. We will also periodically publish hard copies of future Reaction Attempts editions.

Labels: , , ,

Wednesday, April 28, 2010

Reaction Attempts Book Edition 1 and UsefulChem Archive

I am pleased to report that Andrew Lang and I have published the first edition of the Reaction Attempts book. It currently contains most of the Ugi reactions from the UsefulChem project and is associated with an April 27, 2010 snapshot archive of the entire UsefulChem project, including NMR spectra, spreadsheets, images and the entire lab notebook from Wikispaces.


At 582 pages the printing cost from LuLu amounts to $26.28. Not meant to replace electronic searches, it should prove to be a handy reference book for the lab to quickly browse through what was attempted for a given reactant, what the outcome was and the researcher involved.

We are hoping to include reaction attempts from other groups in future editions. More details can be found in the preface, reproduced below:

Reaction Attempts First Edition

Data Source: the UsefulChem project

Introduction

Open Notebook Science (ONS) refers to the practice of making the full contents of a laboratory notebook and all associated raw data files available in near real time.[1] This represents an opportunity for everyone to benefit from work in progress in an open research group. However, in order to make use of the information, it must be easily discoverable. A simple strategy to increase discoverability is redundancy over multiple communication platforms.

In another project - the Open Notebook Science Solubility Challenge[2] - we published non-aqueous solubility data in the form of physical and downloadable (PDF) books.[3] Although it is possible to search the solubility database using web query interfaces, exploration of a Google Spreadsheet, an XML feed, etc.[4], having a physical copy in the laboratory has proved to be very convenient in several instances. A similar format for reactions will also be useful.

The UsefulChem Project

UsefulChem started in 2005 as an organic chemistry Open Notebook Science project with a main goal of discovering new anti-malarial agents that can be prepared by simple and cheap syntheses.[5] Most of the reactions on UsefuChem are Ugi reactions, which involve the mixing of an amine, aldehyde, carboxylic acid and isonitrile in a solvent at room temperature generally for a few hours to days.[6] The multicomponent design of the Ugi reaction and the simple reaction conditions make it ideal for exploring large virtual libraries and selecting compounds of interest to make.[7]

Isolation of the Ugi products can be immensely simpler, cheaper and readily scalable if they precipitate in pure form from the reaction mixture. To this end, much of the research in the UsefulChem project focuses on reaction conditions that lead to this outcome.[8] This is in fact the origin of the ONS Solubility Challenge discussed above.[9]

The Reaction Attempts Database

In order to look for patterns in the reaction conditions which led to Ugi product precipitation, the CombiUgiResults Google Spreadsheet was set up.[10] Reactions indexed there can be sorted by precipitation outcome, solvent, reactant, concentration, etc. and links to the laboratory notebook pages can be followed for full details. However, this sheet is designed specifically for Ugi reactions and contains columns specifically for the aldehyde, amine, carboxylic acid and isonitrile.

In order to enable the tracking of other types of reactions, the information in the CombiUgiResults sheet was reformatted into two other sheets: ReactionAttempts[11] (containing reagents and reactants) and RXIDsReactionAttempts[12] (containing reaction conditions and results, such as solvent, concentration of limiting reactant, appearance of a precipitate, yield, etc.). The two sheets are connected via the use of a common ReactionID. This format permits the representation of any type of reaction, with an unlimited number of reactants and products.[13]

By definition, any Open Notebook Science project in a work in progress. The listing of a reaction in this database only means that the researcher attempted or is in the process of attempting it. Whatever the situation, a link to the laboratory notebook page is provided, where the most recent information is available. The philosophy used here is that partial information is always better than no information at all. Thus a researcher investigating the prior use a particular reactant in a Ugi reaction might find the report that a precipitate was obtained in methanol helpful for designing their own reactions, even if the characterization of the precipitate is still pending. At the very least, knowing that a certain researcher has at least attempted a similar reaction is enough information for initiating a discussion, which may lead to valuable insights.

Reaction Attempts on Chemspider

Although SMILES[14] are provided in the spreadsheets, the primary key to identify compounds is the ChemSpider ID (CSID)[15]. This allows us to render molecule images in the book automatically. In the case of the ONS Solubility Challenge book[3], use of the CSID enables a convenient way to calculate various descriptors for displaying values in the book.

In addition, the compounds in the Reaction Attempts database are indexed on ChemSpider as two Data Sources: ReactantsAttemptedReactions and ProductsAttemptedReactions[13]. In this way a substructure search for either reactants or products will identify indexed molecules. Clicking on the Syntheses tab in the ChemSpider record for a selected molecule will then reveal a list of hyperlinks to the relevant laboratory notebook pages.

Organization of the Book

In keeping with the layout of the ONS Solubility Challenge Book, the reactants are listed in alphabetical order. Each entry displays the list of reactions where the reactant was used. This includes a scheme with all reactants and product as well as key metadata: the researcher, reaction type, solvent, limiting reactant concentration, observation of a precipitate, comments and a reference (links to the laboratory notebook page).

In this edition, only Ugi reactions are included. The reaction schemes are laid out in the following order: carboxylic acid, amine, aldehyde and isonitrile. This should allow for easy comparison between schemes within a given record. Reactions where the Ugi product was isolated and characterized are marked with a green check and the percent yield is noted. Since the Ugi products do not have simple common names, they are not included as separate entries. However, all reactions where the synthesis of a specific Ugi product was attempted can be found by looking up the entries for any of the four reactants.

Although this compilation is not exhaustive, it does cover the vast majority of reactions in the UsefulChem project to date. Future editions will include other reactions from UsefulChem and other sources.

Archive

This edition is linked to the UsefulChem data archive (ZIP)[16], (DVD)[17] and interactive hosted archive format[18], ReactionAttempts (XLS)[19] and RXIDsReactionAttempts(XLS)[20] taken on 2010-04-27.

References

1. Open Notebook Science Wikipedia Entry http://en.wikipedia.org/wiki/Open_Notebook_Science
2. Open Notebook Science Solubility Challenge Wiki http://onschallenge.wikispaces.com
3. Bradley, J.-C. First Edition of ONS Solubility Challenge Book UsefulChem Blog (2009)
http://usefulchem.blogspot.com/2009/12/first-edition-of-ons-solubility.html
4. Open Notebook Science Solubility Challenge List of Experiments page http://onschallenge.wikispaces.com/list+of+experiments
5. UsefulChem Wiki http://usefulchem.wikispaces.com
6. Ugi Reaction Wikipedia Entry http://en.wikipedia.org/wiki/Ugi_reaction
7. Dömling, A., & Ugi, I. (2000). Multicomponent Reactions with Isocyanides. Angewandte Chemie International English Edition, 39(18), 3168-3210. http://www3.interscience.wiley.com/journal/73500473/abstract.
8. UsefulChem List of Experiments http://usefulchem.wikispaces.com/All+Reactions
9. Bradley, J.-C. Open Notebook Science Challenge UsefulChem Blog (2008)
http://usefulchem.blogspot.com/2008/09/open-notebook-science-challenge.html
10. CombiUgiResults Google Spreadsheet http://spreadsheets.google.com/ccc?key=plwwufp30hfpUERhse9y5Kw
11. ReactionAttempts Google Spreadsheet
http://spreadsheets.google.com/ccc?key=0Ak1R8T6wt4YQdG9NejNLcDNUMkVBVURGM01TR0NxdXc
12. RXIDsReactionAttempts Google Spreadsheet
http://spreadsheets.google.com/ccc?key=0Ak1R8T6wt4YQdGVENVFMWjdzaGd2REJTTnA4RG5vblE
13. Bradley, J.-C. Reaction Attempts on ChemSpider UsefulChem Blog (2010)
http://usefulchem.blogspot.com/2010/03/reaction-attempts-on-chemspider.html
14. SMILES Wikipedia Entry http://en.wikipedia.org/wiki/Simplified_molecular_input_line_entry_specification
15. ChemSpider Web Site http://www.chemspider.com/
16. UC archive Drexel server (ZIP) http://showme.physics.drexel.edu/usefulchem/archives/usefulchem2010-04-27.zip
17. UC archive on lulu.com (DVD) http://www.lulu.com/product/dvd/usefulchem-archive/10791847
18. UC interactive hosted format http://showme.physics.drexel.edu/usefulchem/archives/usefulchem2010-04-27/All%20Reactions.html
19. Bradley, J.-C.; Lang, A.. Reaction Attempts Reactants and Products. UsefulChem. 2010-04-27.
(Archived by WebCite® at http://www.webcitation.org/5pIsFEbT9)
20. Bradley, J.-C.; Lang, A.. Reaction Attempts RXIDs. UsefulChem. 2010-04-27.
(Archived by WebCite® at http://www.webcitation.org/5pIs2eh62)

Labels: , , , , ,

Thursday, April 08, 2010

Scientists Embrace Openness Article in Science Careers

Chelsea Wald just published an article in Science Careers: Scientists Embrace Openness (April 9, 2010). She interviewed several people in the Open Science movement including Jonathan Eisen, Steve Koch, Anthony Salvagno, Carl Boettiger and myself.

The article covers Open Notebook Science, Open Data and associated themes. I think it presents a view of the most commonly discussed advantages and disadvantages very well.

One section was particularly relevant to an issue I recently posted about - (and discussed on FriendFeed):
Open Notebook Science advocates claim that being open may protect a scientist's ideas rather than exposing them to theft. Newton's decision to conceal his findings within an anagram made it harder for him to prove priority over rival Gottfried Leibniz. Open Notebook scientists say all they need to do is point to their open notebooks to show that they had an idea or found a result first.

Labels: , ,

Tuesday, April 06, 2010

ONS t-shirts from Zazzle

Inspired by Graham Steel, I just received my t-shirt with an Open Notebook Science Logo and a picture of our crystal on the cover of our ONS Solubility Challenge book.

I was going to set up an ONS store but Zazzle does not permit zero royalties (don't see the logic there). But making up t-shirts on Zazzle is super simple - just grab a logo of your choice from the ONSclaims wiki.

Any other pic is your choice - this is the crystal from UCEXP150C


You can also order all kinds of other personalized things, including coffee cups.

Labels: , , ,

Creative Commons Attribution Share-Alike 2.5 License