Bioinformatics compute pipeline frameworks
This page lists desiderata and pragmatic options for construction of bioinformatics pipelines (e.g. for genome annotation).
%COMMENT{type="above"}%
Practical options
(See also Make Comparison.)
Procedural hacks:
- Scripting languages
- Shengqiang Shu's SAPS (Sequence Analysis Pipeline System)
- Perl modules in BDGP CVS: public-CVS/saps
Quick & dirty declarations:
More sophisticated functional/declarative languages:
- Termite Scheme
- possibly insufficiently developed for production use
- implementation is lightweight, though, apparently
- would play very nicely with xrate format
GUI-fied and GRIDdled:
Desiderata
- Genericity
- no domain specific (eg bioinformatics) code
- user can define functions/modules for domain specific apps
- Programmability
- not a static config format
- extensible
- DSL (domain specific language)
- Syntactic support for perspicuous unix command generation
- makefiles get some things right here
- Separation of logic from configuration
- functional style
- allow infix operators - s-exprs too unfriendly
- Layered on top of a (pure?) functional/logical language
- Monads?
- Lazy/strict/eager?
- GUI/IDE optional but not required
- Implicit Parallelisation
- async and sync execution
- separation of concerns: no explicit parallelisation in language
- dependencies figured out automatically
- optimal parallelisation strategy chosen
- allow both threading and compute farm dispatch
- Transparent execution
- Intermediate results can be stored using system of choice
- filesystem (ie make)
- database (relational, BDB, ...)
- Job status also stored using system of choice
- Configurable dependency triggers
- dependency rules encoded in DSL
- defaults:
- by timestamp (a la Makefiles)
- by MD5 hash of contents
- or of some "normal form" representation of the contents
- custom parse of contents to extract date/modified field
(eg dependencies on web pages/URLs)
- OS coupling
- Make it easy to plug any command line script in
- bioperl, xslt, awk, unix piping etc for free
- No requirement for webservices
- but make it easy to integrate
- Rule and database system
- builtin
- integrated with functional language
- Interactive
- Interpreted or incrementally compiled
- instant results from command line shell
Options: rule based (cf Makefiles, Prolog) or functional
See also: Erlang Language, Termite Scheme
Perhaps this could be layered on termite. We would still want to
define a DSL that would compile down to Lisp S Expressions. Termite would
be problematic for launching thousands/millions of jobs. We want
something that will work with PBS/LFS/GridEngine etc. Perhaps this is
still possible with termite? Call/cc?
For the DSL the main choice is between a rule-based syntax (cf
Makefiles, Skam) or functional syntax.
Here is an example of the latter; this function call gets the top hits
from blasting a collection of sequences against a collection of
databases:
map(\D ->
map(\[P S] ->
filter(\H ->
score(H) > 200
hits(blast(P mask(S) D)))
select([P S] program_seq_database(P S D)))
select(D fastadb(D)))
The select function queries from a datalog-style database (here
encoding a table fastadb/1 and a rule program_seq_database/3 which
retrieves sequences that can be blasted against a particular database
and the blast program that must be used). The hit, blast and mask
functions would be defined elsewhere; the latter two would involve
invoking blast and repeatmasker via the command line.
Parallelisation would be implicit: a config could state whether to
execute this synchronously (ie each functional call blocks) or
asynchronously. With async, dependencies would be figured out
automatically (eg blast(P mask(S) D) would have to wait on the results
of mask(S), but each application of the lambda function in map could
be launched in parallel). Parallelisation strategy could be
configured; for example, anything launching a command on the O/S could
be considered parallelisable.
With a pure FL, functions are reentrant. Calling this a second time
would not launch the programs. The results would be
tabled/memoized. Memoization could happen via a database or the
fileystem; unlike typical memoization, the tabled results are
persistent. They should also be transparent - users will see blast and
repeatmasker files appearing on their filesystem as threads complete.
The dependency system is coupled with the memoization system. eg a
change in timestamp or md5 hash of a fastadb will flag the results of
certain function calls as stale (and all dependent function
results). This means persisting the reduction graph.
Allow pattern matching in function definitions. Basic type
system. This gives us the power of OO.
Open question: how to handle side-effects. Currently my favoured model
is for a function eg blast(P S D) to return an atom/data object and to
execute blast as a side effect. The atom/object can be used to lookup
the results (eg in filesystem, on db).
-- Chris Mungall - 10 Feb 2006
-- Added links -- IanHolmes - 06 Mar 2007 |