Bioinformatics workflows can be approached in several ways.
Unix hackers often resort to GNU make
while computer scientists dream of more elegant approaches
Meanwhile, projects like Taverna offer the beginnings of GUI-designed grid workflows.
leads development on
(recently covered by Hublog
Flags and Lollipops
There's also a Taverna page on the Mygrid TWiki
When Tom told me about Taverna at ISMB2003, I admit I was initially skeptical.
My reaction was much like Stew's at
F & L:
...while 50% of the components I use during bioinformatics work might be stable objects that I need regularly - to fetch sequences, convert from GFF to FASTA, get some GO terms, etc. - the remaining 50% change frequently, and there's often some non-pluggable piece of software involved. To be able to add it to my workflow I need to wrap it somehow (so it can be used as a component) and have some sort of code glue to convert inputs and outputs into recognizable formats. I've not delved into the Taverna docs deeply enough to know for sure that there's not an easy way to do this, but I suspect Beanshell has to be involved as glue. That's a lot of coding in different languages when I can make both SOAP and system calls in a three line Perl script...
I'm a great believer in three-line Perl scripts.
Still, once I was young, and I did believe that workflows would be the way forward.
The idea of click'n'draggable Unix pipes is just so cute.
Interactive fiction has wasted
and those octopus chips
So I downloaded Taverna onto my OS X box.
Unfortunately, I have to admit I still just don't get it.
I love web apps. I love the client-server model.
But Stew's rant on distributed computing
basically sums up my attitude. Why not just hack something inelegant but funky in Perl instead?
Or Python, I don't much care.
OK, more seriously, I like Sun Grid Engine
and GNU Make
What's wrong with that? (OK, there is some stuff wrong with that. See Andrew's Makefile Manifesto
for some of what's wrong with that...
Point is, make gets some stuff right: the declarative structure, the idea of dependencies, the convenient hooks into Unix. See the BioMake
page for a deeper discussion of what's essentially right with make and what's missing.)
My aversion to GUIfied, GRIDdled workflows still lives, I'm afraid.
It didn't help that when I opened the first example in the Taverna tutorial
a 24-node graph popped onto the screen.
Now, a Perl script to scuttle around the Gene Ontology would probably weigh in under 24 lines
I teach basic Unix and Perl as part of my introductory bioinformatics classes,
and right up-front I should admit that the Perl definitely feels like dinosaur-speak.
I can't honestly pretend it's a language with longevity and broad application;
it ain't. But the Unix skills do
have those qualities.
I find myself wondering, is this kind of workflow really a viable replacement for all that incredibly-powerful Unix command-line arcana?
It's not as if my students seem to have that
much problem with passing files around
Tom is a great software engineer and has the right motives.
For what it's worth, Taverna seems to be very well executed, and
the general idea of a GUI for Unix pipes and sockets will never stop being cute.
Personally, I think the way forward for clickable bioinformatics is
to build good web apps from existing pragmatic frameworks, like
together with robust Unix tools that use well-documented file formats
Golden rules for bioinformatics web applications
is worth a read.)
It also seems to me that a workflow that relies on connections to a bunch of other sites on the internet is pretty fragile,
compared to the tedious and expensive, but (importantly) stable, solution of setting up your own cluster.
pointed out that the "I like three-line Perl scripts" argument is not a robust defense against web-enabled workflows.
After all, any three-line Perl script should in principle
be easy to offer as a Grid-enabled service.
However, the point is that I don't think the technology is quite there yet.
It took a decade for YahooPipes
to offer this kind of thing as a robust, stable, user-friendly platform
and their user interface is still
a bit clunky; plus, their site was instantly swamped by demand.
In some ways, Taverna-like systems address a fundamentally different question from
the pipeline engineering issue of what server-side infrastructure to use (see BioMake
for a discussion of this).
Taverna is all about making the "workflow" transparent so that a Unix-naive user can build their own pipeline.
However, I don't think you can design a genome-scale pipeline unless you have either
(i) a good practical working knowledge of the server-side resource usage of the components or
(ii) an infinitely scalable grid.
I suppose that workflow fans assume that (ii) will happen any day now, and maybe this is true;
however, user interface design still seems to me like an essentially different problem to pipeline design.
For the "right" way to do pipelines (as opposed to the quick-and-dirty way), I lean towards functional languages, as discussed on the BioMake
In fact, I tend to favor the quick-and-dirty way, i.e. GNU make
It doesn't interface all that well with SGE, which (in my view) is its biggest drawback in practice.
However, this is a drawback that may be fixed soon (given the existence of multiple competing
efforts in this direction).
has listed some problems with GNU make in his makefile manifesto
I agree with most of these, and would really like to see something more functionally elegant, e.g. using the Erlang language
A disclaimer for all this: I was once a low-level coder who used to write sprite routines in 8-bit assembler for games.
I am highly resistant to all forms of technology; e.g. until last year I refused even to touch SQL databases,
claiming that my own R-trees
would always be faster.
So you should take my opinion on fancy new technologies with a grain of salt; unless, of course,
you're a Luddite like me, in which case you've already made up your mind, and good luck to you.
- 08 Jan 2006
PS: as I said, my own tendencies would lead me towards the functional solutions described on the BioMake
page, though I actually just tend to use regular old GNU make
for command-line analysis.
(Addendum on 3/3/2007: But see Andrew's MakefileManifesto for some drawbacks of the makefile approach...)
PPS -- note on the octopus chips and the interactive fiction:
I met TomOinn
around 1996 when he was a Cambridge undergrad and I was a Sanger grad student.
At the time he (re-)introduced me to
and some interesting snacks made of pressed octopus.
PPPS: check it out, I think I found the chips