Bioinformatics Workflows

From Biowiki
Jump to: navigation, search

Bioinformatics workflows

Bioinformatics workflows can be approached in several ways. Unix hackers often resort to GNU make, while computer scientists dream of more elegant approaches. Meanwhile, projects like Taverna offer the beginnings of GUI-designed grid workflows.


Tom Oinn leads development on Taverna (recently covered by Hublog, Propeller Twist and Flags and Lollipops). There's also a Taverna page on the Mygrid TWiki.

When Tom told me about Taverna at ISMB2003, I admit I was initially skeptical. My reaction was much like Stew's at F & L:

...while 50% of the components I use during bioinformatics work might be stable objects that I need regularly - to fetch sequences, convert from GFF to FASTA, get some GO terms, etc. - the remaining 50% change frequently, and there's often some non-pluggable piece of software involved. To be able to add it to my workflow I need to wrap it somehow (so it can be used as a component) and have some sort of code glue to convert inputs and outputs into recognizable formats. I've not delved into the Taverna docs deeply enough to know for sure that there's not an easy way to do this, but I suspect Beanshell has to be involved as glue. That's a lot of coding in different languages when I can make both SOAP and system calls in a three line Perl script...

I'm a great believer in three-line Perl scripts. Still, once I was young, and I did believe that workflows would be the way forward. The idea of click'n'draggable Unix pipes is just so cute. (Go Yahoo Pipes!) Interactive fiction has wasted many a pleasant hour; and those octopus chips were tasty. So I downloaded Taverna onto my OS X box.

Unfortunately, I have to admit I still just don't get it. I love web apps. I love the client-server model. But Stew's rant on distributed computing basically sums up my attitude. Why not just hack something inelegant but funky in Perl instead? Or Python, I don't much care.

OK, more seriously, I like Sun Grid Engine and GNU Make. What's wrong with that? (OK, there is some stuff wrong with that. See Andrew's Makefile Manifesto for some of what's wrong with that... Point is, make gets some stuff right: the declarative structure, the idea of dependencies, the convenient hooks into Unix. See the Bio Make page for a deeper discussion of what's essentially right with make and what's missing.)

My aversion to GUIfied, GRIDdled workflows still lives, I'm afraid. It didn't help that when I opened the first example in the Taverna tutorial ("ShowGeneOntologyContext.xml"), a 24-node graph popped onto the screen. Now, a Perl script to scuttle around the Gene Ontology would probably weigh in under 24 lines. I teach basic Unix and Perl as part of my introductory bioinformatics classes, and right up-front I should admit that the Perl definitely feels like dinosaur-speak. I can't honestly pretend it's a language with longevity and broad application; it ain't. But the Unix skills do have those qualities. I find myself wondering, is this kind of workflow really a viable replacement for all that incredibly-powerful Unix command-line arcana? It's not as if my students seem to have that much problem with passing files around between scripts.


Tom is a great software engineer and has the right motives. For what it's worth, Taverna seems to be very well executed, and the general idea of a GUI for Unix pipes and sockets will never stop being cute.

Personally, I think the way forward for clickable bioinformatics is to build good web apps from existing pragmatic frameworks, like GBrowse, together with robust Unix tools that use well-documented file formats like Stockholm Format, Fasta Format or Gff Format. (Stew's Golden rules for bioinformatics web applications is worth a read.) It also seems to me that a workflow that relies on connections to a bunch of other sites on the internet is pretty fragile, compared to the tedious and expensive, but (importantly) stable, solution of setting up your own cluster.

Chris Mungall pointed out that the "I like three-line Perl scripts" argument is not a robust defense against web-enabled workflows. After all, any three-line Perl script should in principle be easy to offer as a Grid-enabled service. However, the point is that I don't think the technology is quite there yet. It took a decade for Yahoo Pipes to offer this kind of thing as a robust, stable, user-friendly platform and their user interface is still a bit clunky; plus, their site was instantly swamped by demand.

In some ways, Taverna-like systems address a fundamentally different question from the pipeline engineering issue of what server-side infrastructure to use (see Bio Make for a discussion of this). Taverna is all about making the "workflow" transparent so that a Unix-naive user can build their own pipeline. However, I don't think you can design a genome-scale pipeline unless you have either (i) a good practical working knowledge of the server-side resource usage of the components or (ii) an infinitely scalable grid. I suppose that workflow fans assume that (ii) will happen any day now, and maybe this is true; however, user interface design still seems to me like an essentially different problem to pipeline design.

Bio Make

For the "right" way to do pipelines (as opposed to the quick-and-dirty way), I lean towards functional languages, as discussed on the Bio Make page.

In fact, I tend to favor the quick-and-dirty way, i.e. GNU make. It doesn't interface all that well with SGE, which (in my view) is its biggest drawback in practice. However, this is a drawback that may be fixed soon (given the existence of multiple competing efforts in this direction).

Andrew Uzilov has listed some problems with GNU make in his makefile manifesto. I agree with most of these, and would really like to see something more functionally elegant, e.g. using the Erlang language or Termite Scheme.

A disclaimer for all this: I was once a low-level coder who used to write sprite routines in 8-bit assembler for games. I am highly resistant to all forms of technology; e.g. until last year I refused even to touch SQL databases, claiming that my own R-trees would always be faster. So you should take my opinion on fancy new technologies with a grain of salt; unless, of course, you're a Luddite like me, in which case you've already made up your mind, and good luck to you.

-- Ian Holmes - 08 Jan 2006


PS: as I said, my own tendencies would lead me towards the functional solutions described on the Bio Make page, though I actually just tend to use regular old GNU make for command-line analysis. (Addendum on 3/3/2007: But see Andrew's Makefile Manifesto for some drawbacks of the makefile approach...)

PPS -- note on the octopus chips and the interactive fiction: I met Tom Oinn around 1996 when he was a Cambridge undergrad and I was a Sanger grad student. At the time he (re-)introduced me to interactive fiction and some interesting snacks made of pressed octopus.

PPPS: check it out, I think I found the chips.