Bioinformatics Tool Design

From Biowiki
Jump to: navigation, search

The Laws of Bioinformatics Tool Design

A few of the blindingly obvious principles I've come to appreciate grudgingly been forced to acknowledge after having them hammered into my thick skull over the years...

-- Ian Holmes - 01 Apr 2007

(Name pinched from Raph Koster's Laws of Online World Design)

(See also File Format Design for a much more amusing take on this)

---

---

Follow the Unix tool philosophy

A tool is a simple program, usually designed for a specific purpose, it is sometimes referred to (at least throughout this document) as a command.

The “Unix tools philosophy” emerged during the creation of the UNIX operating system, after the breakthrough invention of the pipe '|'...

The pipe allowed the output of one program to be sent to the input of another. The tools philosophy was to have small programs to accomplish a particular task instead of trying to develop large monolithic programs to do a large number of tasks. To accomplish more complex tasks, tools would simply be connected together, using pipes.

All the core UNIX system tools were designed so that they could operate together. The original text-based editors (and even TeX and LaTeX) use ASCII (the American text encoding standard; an open standard) and you can use tools such as; sed, awk, vi, grep, cat, more, tr and various other text-based tools in conjunction with these editors.

Using this philosophy programmers avoided writing a program (within their larger program) that had already been written by someone else (this could be considered a form of code recycling). For example, command-line spell checkers are used by a number of different applications instead of having each application create its own own spell checker.

From http://tldp.org/LDP/GNU-Linux-Tools-Summary/html/the-unix-tools-philosophy.html

Bioinformatics is, first & foremost, a Unix discipline. Avoid GUIs; avoid the zealots who try to get you to do everything in Matlab; avoid menu-driven software. It's got to be scriptable, and it works best when it's modular.

Write filters where possible

In UNIX and UNIX-like operating systems, a filter is program that gets most of its data from standard input (the main input stream) and writes its main results to standard output (the main output stream). UNIX filters are often used as elements of pipelines. The pipe operator ("|") on a command line signifies that the main output of the command to the left is passed as main input to the command on the right.

From http://en.wikipedia.org/wiki/Filter_(Unix)

Use established file formats

Don't ever invent a new file format unless you absolutely need to. If what you're trying to do can be represented as a special case of some more complex format, then use that.

There are so many file formats around, it's very likely that one will satisfy your needs. For example: Fasta Format, Newick Format, Gff Format, Stockholm Format, Cigar Format, Genbank Format...

Be consistent with formats

Where you have a choice of formats, try to consistently use one format and stick to it. For example, the tools in our lab mostly use Stockholm Format rather than gapped FASTA format for alignments. We provide Perl scripts (DartPerl:fasta2stockholm.pl, DartPerl:stockholm2fasta.pl) to do the conversions.

Think functionally

Functional programming is a programming paradigm that conceives computation as the evaluation of mathematical functions and avoids state and mutable data. It emphasizes the application of functions, in contrast with imperative programming, which emphasizes changes in state.

From Wikipedia:Functional_programming

A lot of good principles flow from functional programming. For example, the idea that certain steps in your pipeline can be broken down into co-ordinate transformations that can be inverted. Another good principle (taken from the Lambda Book) is to make use of invariants where possible (e.g. to identify functions that remain constant throughout a loop or iteration).

This leads back to the idea of using well-defined file formats: a function must have precisely defined inputs & outputs.

Fulfil a need

A tool that doesn't meet a well-defined need is an exercise in masturbation. Avoid. (Or at least, don't expect to get paid for it)

Re-use established tools

An almost-obvious extension of the "Use established formats" rule, and a corollorary of the "fulfil a need". Don't rewrite a tool that's already been written (unless, of course, you absolutely need to do so because the original sucks beyond human belief).

Use portable standards

Portability almost goes without saying, but it should not override the "Use established file formats" rule.

Many bioinformatics newbies have asserted, full of idealistic fervour, that they won't rest until every FASTA, Newick or Genbank file has been converted to some dialect of XML. Many Bothans died to bring us this useless piece of ideology.

Pay attention to distribution

Do you really want 9 out of 10 potential downloaders to be put off because they couldn't build your package, due to some broken dependency that you took for granted? Do you really want annoying "won't compile" emails from the remaining 10%?

A model for everyone is Guy Slater's Exonerate program, which uses autoconf. Our DART software is not quite as slick, but at least lists its dependencies in an INSTALL package in the root directory (and compiles easily on a GNU system).

RPM's or equivalent (e.g. CPAN) are another way to go, for the dedicated.

Pay attention to documentation

Ideally, your tarball should include a README file, an INSTALL file, a few examples illustrating usage...

A manpage is cool too, if you're really keen (I seldom get around to this, though).

Coarse-grain it

Parallelize by coarse-graining. There is seldom any need to get fancy when distributing bioinformatics tasks across a cluster, since those tasks often involve iterating over (say) a thousand sequences. It is, therefore, easy enough to split the sequence database into a thousand parts and run as a thousand separate jobs. There's seldom any need to think in terms of wave algorithms, or other complicated ideas from distributed algorithm theory.

Automate it

So many bioinformatics tools need to be baby-sat. If I wanted to poke my way through a bunch of braindead menus, I'd call the phone company. Command-line options, people, please...

The ultimate in automation is the use of declarative build tools like make. If you design your program with such tools in mind then you're also thinking functionally, and deserve a gold star. See this page for further discussion of all this.

---

Comments

  • This doesn't really touch on databases/SQL, to which I have recently become a convert... although a lot of the things here could be ported to database philosophies, e.g. "use established file formats" --> "use established schemas" (a principle that I have regretfully broken). But I'm not sure how things like piping and Makefiles would work, unless you are piping to/from database interface scripts?... - Andrew Uzilov - 01 Apr 2007 19:42:34
  • If I knew the rules for designing bioinformatics database workflows I'd post them. These rules are for Unix programs, and I am hesitant to extrapolate them to RDBMS's, which have their own design principles. That doesn't mean the above rules are invalidated if you're using a database, though: most of your sequence analysis (for example) will still be done in Unix; your tools should still use established file formats (GFF, FASTA, etc.) and this will probably influence your schema. I still advocate grounding yourself by considering dataflow in terms of existing formats and not trying to re-invent the wheel. For more advanced treatments, Chris's Bio Make project is the closest thing I know to an automated build tool that hooks up SQL to Unix commands. - Ian Holmes - 01 Apr 2007 20:05:14
  • (BTW, SQL is a functional language, so "think functionally" -- which is clearly the most important rule here by far -- still applies.) - Ian Holmes - 01 Apr 2007 20:09:35