Bio Make

From Biowiki
Jump to: navigation, search

+ Bioinformatics compute pipeline frameworks

This page lists desiderata and pragmatic options for construction of bioinformatics pipelines (e.g. for genome annotation).



++ Practical options

(See also Make Comparison.)

Procedural hacks:

  • Scripting languages
    • Shengqiang Shu's SAPS (Sequence Analysis Pipeline System)
      • Perl modules in BDGP CVS: public-CVS/saps

Quick & dirty declarations:

More sophisticated functional/declarative languages:

  • Termite Scheme
    • possibly insufficiently developed for production use
    • implementation is lightweight, though, apparently
    • would play very nicely with xrate format

GUI-fied and GRIDdled:

++ Desiderata
  • Declarative
  • Genericity
    • no domain specific (eg bioinformatics) code
    • user can define functions/modules for domain specific apps
  • Programmability
    • not a static config format
    • extensible
    • DSL (domain specific language)
    • Syntactic support for perspicuous unix command generation
      • makefiles get some things right here
    • Separation of logic from configuration
    • functional style
      • allow infix operators - s-exprs too unfriendly
    • Layered on top of a (pure?) functional/logical language
      • Monads?
      • Lazy/strict/eager?
    • GUI/IDE optional but not required
  • Implicit Parallelisation
    • async and sync execution
    • separation of concerns: no explicit parallelisation in language
      • dependencies figured out automatically
      • optimal parallelisation strategy chosen
    • allow both threading and compute farm dispatch
  • Transparent execution
    • Intermediate results can be stored using system of choice
      • filesystem (ie make)
      • database (relational, BDB, ...)
    • Job status also stored using system of choice
  • Configurable dependency triggers
    • dependency rules encoded in DSL
    • defaults:
      • by timestamp (a la Makefiles)
      • by MD5 hash of contents

* or of some "normal form" representation of the contents

      • custom parse of contents to extract date/modified field

(eg dependencies on web pages/URLs)

  • OS coupling
    • Make it easy to plug any command line script in
      • bioperl, xslt, awk, unix piping etc for free
    • No requirement for webservices
      • but make it easy to integrate
  • Rule and database system
    • builtin
    • integrated with functional language
  • Interactive
    • Interpreted or incrementally compiled
    • instant results from command line shell

Options: rule based (cf [GNU make Makefiles], Prolog) or functional

See also: Erlang Language, Termite Scheme

Perhaps this could be layered on termite. We would still want to define a DSL that would compile down to Lisp S Expressions. Termite would be problematic for launching thousands/millions of jobs. We want something that will work with PBS/LFS/GridEngine etc. Perhaps this is still possible with termite? Call/cc?

For the DSL the main choice is between a rule-based syntax (cf Makefiles, Skam) or functional syntax.

Here is an example of the latter; this function call gets the top hits from blasting a collection of sequences against a collection of databases:

map(\D ->
  map(\[P S] -> 
	 filter(\H ->
		score(H) > 200
		hits(blast(P mask(S) D)))
	 select([P S] program_seq_database(P S D)))
  select(D fastadb(D)))

The select function queries from a datalog-style database (here encoding a table fastadb/1 and a rule program_seq_database/3 which retrieves sequences that can be blasted against a particular database and the blast program that must be used). The hit, blast and mask functions would be defined elsewhere; the latter two would involve invoking blast and repeatmasker via the command line.

Parallelisation would be implicit: a config could state whether to execute this synchronously (ie each functional call blocks) or asynchronously. With async, dependencies would be figured out automatically (eg blast(P mask(S) D) would have to wait on the results of mask(S), but each application of the lambda function in map could be launched in parallel). Parallelisation strategy could be configured; for example, anything launching a command on the O/S could be considered parallelisable.

With a pure FL, functions are reentrant. Calling this a second time would not launch the programs. The results would be tabled/memoized. Memoization could happen via a database or the fileystem; unlike typical memoization, the tabled results are persistent. They should also be transparent - users will see blast and repeatmasker files appearing on their filesystem as threads complete.

The dependency system is coupled with the memoization system. eg a change in timestamp or md5 hash of a fastadb will flag the results of certain function calls as stale (and all dependent function results). This means persisting the reduction graph.

Allow pattern matching in function definitions. Basic type system. This gives us the power of OO.

Open question: how to handle side-effects. Currently my favoured model is for a function eg blast(P S D) to return an atom/data object and to execute blast as a side effect. The atom/object can be used to lookup the results (eg in filesystem, on db).

-- Chris Mungall - 10 Feb 2006

-- Added links -- Ian Holmes - 06 Mar 2007