Make Use Cases

From Biowiki
Jump to: navigation, search

Use cases for GNU Make in bioinformatics analyses

(I ended up manually editing the Graph Viz diagrams on this page, rather than using Makefile Visualization tools...)

Running xrate on Fly Mavid Windows to look for RNA genes

This represents a small part of the xrate pipeline which we are trying to build. The full pipeline does repeatmasking, whole-genome alignment and sliding-window annotation. What follows is the last part of that workflow.

Makefiles in directory /nfs/data/genome/fly, filenames Makefile, Makefile.scan & Makefile.windows.

General structure (note that the arrows indicate dependencies, which is the reverse direction to workflow):

Graph image creation requires permission to upload.

Here's a more detailed figure, reflecting more of the actual structure of the makefile and showing some of the pseudotargets used to iterate over directories.

Graph image creation requires permission to upload.

The window generation itself is kind of an example of a dysfunctional Makefile (sorry Andrew)... basically just a bunch of isolated makefile stanzas with no dependencies, and you have to remember what order to call them in... you might as well just write a shell script... but it's hard to see what else you could do, since true dependency-based rules would require multiple pattern matching, which make can't do (in above notation, "windows/%/w[xyz].stk" where % is alignment number and [xyz] is window number). Note how the "all-batches" and "all-windows" pseudotargets sidestep this, calling make recursively (via Perl one-liners).

It would certainly be cleaner to stick to one file per MAVID alignment (i.e. not to keep the intermediate window files). In this case, this makes sense for several other reasons too (e.g. saving disk space by not generating thousands of intermediate files). More generally, however, this is a kludge that often comes up with Makefiles: lack of support for multiple-wildcard pattern matching in rules forces you into designing pipelines where there is only one level of granularity for coarse-grained parallel processing (in this case, the alignment).

Measuring loop & stem substitution rates in Rfam

This work is part of an effort to develop useful phylogrammars for detailed annotation of curated RNA gene family alignments, such as those in Rfam.

From /nfs/data/db/rfam/Makefile

General structure:

Graph image creation requires permission to upload.

Below is a more detailed figure. Only a subset of the full rule space is shown, illustrating a couple of the analyses: the lambda-mu scatterplot (stem-rate comparisons) and the variable-speed stem-and-loop annotations.

The diagram is slightly vague/misleading in that the Makefile (as it stands) can't generate the entire analysis with just one "make" command; not quite all of the dependencies work perfectly (so several "make" statements are required in practice).

Broken dependencies, which GNU make can't really handle elegantly (or at least I haven't found a way how), are indicated by dotted lines (not to be confused with pattern-matching rules, which are shown by dashed lines).

Using a Makefile enables the process of developing your analysis. Starting with the general structure above, I was able to easily elaborate & refine the intermediate steps, e.g. introducing a guided training scheme for the trained pfold grammar pfold-param.eg (actually I tried several different training schemes, eventually settling on the one shown here).

Graph image creation requires permission to upload.

-- Ian Holmes - 21 Mar 2007