Sunday, April 12, 2009

Automatic Back-Up of ONS files: Google Spreadsheets, JCAMP-DX, Flickr

As many of you know, we have been heavily dependent upon publicly editable Google Spreadsheets for storing results and calculations relating to our Open Notebook Science projects. We have recently integrated automated processing of NMR files in JCAMP-DX format to calculate solubility data by using web services called directly from within the Spreadsheets.

That represents a lot of distributed technology that is susceptible to network or server problems. Andy Lang, who wrote the web services that currently calculate the solubility, has enabled the recall of previously calculated values via a quick database look-up. While this substantially reduces server load by avoiding lengthy calculations, it does mean that the final numbers do not exist in the Spreadsheets themselves.

In addition to these concerns, every time I give a talk to a group of librarians the issues of archiving and curation of new forms of scholarship are raised. These are valid concerns and I've been trying to work with several groups to deal with the problem in as automatic a way as possible.

We had initially considered a spidering service that would automatically follow every file linked to the ONS wikis and download the documents on a daily basis. This has turned out to be problematic because many of the links don't terminate directly on files, but rather user interfaces. For example, a typical link to a Google Spreadsheet does not lead to a simple HTML page that can be copied but rather to an interface to add data and set up calculations.

It turns out we can take a semi-automated solution that gets us to where we want to be but requires a bit more manual work. Google Spreadsheets can be exported as Excel spreadsheets, which store the results of web service calculations as simple values and include the link to the web service as a cell comment. All calculations within the Spreadsheet are also retained in this way. The trick is to "publish" the spreadsheet using the advanced option of exporting as an Excel file. This then becomes a simple URL.

Now, the only manual step left in the process is to copy these URLs to another BackUp Google Spreadsheet. Andy has created a little executable that steps through a list of these URLs and creates a backup on any Windows computer under a C:\ONS directory. It is simply then a question of setting up a Windows Scheduler service to run once a day and call the executable. All the files are named with the date as a first part of the name for easy sorting.

Besides Google Spreadsheets backed-up as Excel files, spectral JCAMP-DX files and Flickr images can be processed in the same way. In both these cases the user must specify the JDX or DX or JPG file directly. In Flickr you have to go through a few clicks to the download page for a given image but once you have that it works fine.

Andy has versioned this as V0.1 for good reason. It does do exactly what we want but there are a few caveats:

1) Any errors in specifying a file will abort the rest of the back-up. In future versions there would be tolerance for errors, with appropriate reporting of problems, perhaps by email.

2) Files don't necessarily have the correct extensions. For example, backed up Wikispaces pages have to be renamed with an HTML extension to be viewed in a browser. Note that Wikispaces has its own sophisticated back-up system that will put the entire wiki with all files directly uploaded onto the wiki into a single ZIP file - in either HTML or WIKITEXT format. Of course this will not include files residing outside - like Google Spreadsheets. Still I think there is no harm in including the wiki pages in the the list of files to be backed-up by Andy's system.

Going forward there are two types of collaborations that could help a lot:

1) Librarians who would be willing to archive UsefulChem and ONSChallenge files. Right now these are just a few Megs a day but this will increase as we continue to add to the list. To be reasonable about space I could see a protocol of keeping only one back-up per week or month for dates more than 30 days in the past. This is about what the Internet Archive does I think. It would certainly be unambigous to know for certain what was known at what time with multiple libraries maintaining archives.

2) Someone who knows how Google creates URLs for downloadable XLS exports would be mightly helpful. Similar for Flickr and JPG exports. Even just writing a script to spider all HTML pages linked to the wikis and blogs would save a lot of manual labor. The nice thing is that the results of the spidering code would just have to be dumped into the Back-Up Google Spreadsheet - which already backs itself up conveniently.

Labels: , , , ,


At 8:27 AM, Anonymous will said...

Assuming i am understanding your spidering problem correctly, can you not just append "&output=xls" to the end of every spreadsheet url?

Then your spider would download the XLS format directly.

At 8:32 AM, Blogger Jean-Claude Bradley said...

Bill - the issue is that we normally link to the "edit" mode of the spreadsheets from the lab notebooks, not the published modes, which have different URLs. And then we still have to list them. It would be useful to have a bot extract web links from each wiki page and dump them into our backup spreadsheet - we have not written that code yet. This is what I mean by spidering, if only at the first level of connections initially.

At 12:11 PM, Anonymous Anonymous said...

i think i will have a go at this as im generally interested in this area.

do you have list of blogs/domains you want to include?

At 2:21 PM, Blogger Jean-Claude Bradley said...

That sounds great! I would start with the page listing the experiments. I just noticed that the URL does not change so if you automatically figured out which links are Google Spreadsheets you could just append &output=xls. The only caveat is somebody would have to publish the worksheets, which is not a problem.

At 9:17 AM, Anonymous will said...

ok i have written a program that uses the start page you gave and goes down into the links one level.

It produces an xls output you can see the link at the bottom of this post:

It contains the spreadsheet XLS url in the first column and the wiki page it was found on in the second.


Post a Comment

<< Home

Creative Commons Attribution Share-Alike 2.5 License