Makefile Manifesto

From Biowiki
Jump to: navigation, search

---

The cons of using GNU make for job control

Like others, our lab uses Makefiles for bioinformatics pipeline development.

After all, a pipeline is basically an organized way of converting data from one form to another, right? Is this not what Makefiles were designed to do - except for the specific purpose of converting source code to object code? Can we not hack the make system to do our bidding?

Why focus on the cons?

  • Because they are technical and specific, whereas the pros are more abstract and philosophical, therefore the cons must be enumerated more explicitly.
  • Because the pros are somewhat obvious, whereas the cons don't manifest themselves until you really start digging in the guts of the make system.
  • Because it gives us potential problems to spot when searching for a Better System.

This is too long and no one will read it, so it's useless

Maybe. However, it was originally longer: the very act of writing this and trying to verify these problems helped me solve some of them, so this helped at least one person. Also, maybe someone will g**gle for a solution and come across this.

However, this needs to be boiled down to something more abstract and philosophical and hand-waving. Isn't the fine art of abstraction essential in any computing field? Maybe it will come to me in a dream.

At least two other people have read it already

(writes IH) and possibly more. So that's something.

As for "this needs to be boiled down to something more abstract and philosophical and hand-waving", take a look at Chris' Bio Make page. He's since suggested to me that the Erlang Language is conceptually similar to the Prolog Language-based functional language he was developing.

I think Chris has stopped working on Bio Make for the moment, having moved away from genome pipelines as he focuses on Bio Ontologies, but there are some good ideas there.

You know, it can get worse than make, as well as better. I really don't get the hype around Bioinformatics Workflows and "Grids", but it's there, in all its frightening XML-heavy GUI-ness.

But what about the pros?

A very general summary:

  • Simplifies re-running stages of the pipeline (by automating the identification of what data has changed and which stages need to be re-run).
    • Without an automated build system, every time you re-run something you incur a small probability of error. These probabilities accumulate, making an error almost inevitable, as anyone who has tried to develop a big program without using "make" can attest.
  • Saves a lot of typing.
  • Self-documenting.
    • ...although I suppose Perl/Python code to run a pipeline is also "self-documenting", but I think the declarative Makefile form more intuitively reflects the actual steps in a pipeline process.
    • Cutting & pasting from the command line to a README file is not self-documenting.
  • Provides for easy reproducibility of the computational experiments.
    • Maybe you can even distribute it as Supplementary Material for a paper?
      • IH: I don't know about distributing Makefiles as suppl info, but reproducibility is important; so much so that I'd say if it isn't easily and automatically reproducible, then it just isn't reproducible. Period. So I think you need something like make, even for throwaway analysis -- and definitely for a pipeline.
  • Available on many systems:
    • more ubiquitous than Apache Ant (right?);
    • no custom installation of anything required.
  • (Added by IH 3/3/07)
    • make does sit quite well with a command-line approach to life. you can play around with a Unix tool and then cut & paste straight into the makefile
    • the dependency tracking is lightweight, (mostly) intuitive, and expressed in a more-or-less Wikipedia:Declarative_programming_language
    • There are some nice built-in options for debugging like "make -n" (show but don't exec commands), "make -t" (touch instead of updating), "make -d" (print dependencies & other stuff), "make -p" (print the make database of rules and variables after first expansion), etc.
    • A (theoretical) pro is qmake which hooks up with Sun grid engine (theoretical because we haven't got it working as of 3/3/07)
      • distmake is another option
      • [omake program omake] is yet another. omake has MD5-based dependency analysis and is a full Wikipedia:Functional_programming_language. It also supports multiple parallel remote job execution, though apparently without queueing.
    • The fact that it's pre-installed is a pretty slim advantage at best. I don't think it'd be that hard to install Ant (for example)

Disclaimers

I am willing to admit that half of this is caused by my misuse/misunderstanding of make. I would love if anyone would correct me. Everything I know about make comes from the GNU make manual and I think it tends to not focus on some important things. Maybe they expect me to dive into the code?

This was done with GNU make v3.80 (sometimes on v3.79 or v3.81).

---

Specific complaints

These examples are slightly tailored toward our purposes (e.g. the .stk file extensions for Stockholm format files).

TYPO ALERT: I did not test some of this stuff explicitly, as I am recalling it from memory. Please correct me if it's wrong.

---

only one stem

This is arguably the number one fundamental problem with make. Why, oh why, do you only get one stem for targets in pattern rules, e.g.:

%.stk:
		  commands

but not this:

%.%.stk:
		  commands

We've all seen crazy file names like analysis3.null6.dataset1.filterCutoff_15.tab or some other madness that tries to encode several bioinformatics pipeline settings into the file name. Clearly it would be quite advantageous to easily decompose it in a make target, e.g.:

analysis%.null%.dataset%.filterCutoff_%.tab:
		  commands

Think about how sweet it would be to have each stem match an automatic variable you can use in the command body or the dependency list!

Having arbitrary regular expression may be too much to ask for, but multiple stems should be doable by a build system.

Another useful case would be if I want to run something like:

make dir/subdir/file.stk

and have it match the rule:

dir/%/%.stk:
		  commands

With multiple stems, I can easily decompose it into a subdir and a file. Ah, but you say there are automatic variables like $(@D) and $(@F) for that? Well, they don't help if I have a subdir that is more than one level deep (e.g. dir/subdir/file.stk) and I want to pull it out, as in the case above. Now yes, there are string transformation functions to get out the subdir, but those only work in the commands - not in the dependency list! See my next point.

Potential workaround

You can stick stuff into environment variables and use them in the targets, e.g.:

analysis%.null${nullNum}.dataset${datasetNum}.filterCutoff_${cutoff}.tab:

This is actually a pretty elegant solution and works well in practice, especially if you add stuff to your makefile to check that the variables are set and exit with an $(error ...) to user if they're not.

---

automatic variables don't work in the prerequisite list

Why, oh why, can't I do this:

dir/%.stk: $(*D)
		  commands

so that when I run:

make dir/subdir/file.stk

I can use subdir as a prereq (or construct another prereq using subdir in the name) to ensure that it actually exists before running the command. If it doesn't exist, I could have the phony target $(*D) create it. But make doesn't allow you to have these neat directory/file splitting variables in prereq lists, so you have to resort to either changing your workflow to something more complicated, or using something like this:

OK, maybe the above is a stupid example. A more realistic use case would be using string functions like subst to process a target name in the prerequisite list as a workaround for lack of multiple stems. But any function in the prereq list gets expanded before a target name is identified (i.e. before the stem is computed), so the stem % gets treated as the raw symbol % while you are doing the text substitution, instead of being filled in with the value.

Ah, but you say I can use secondary expansion! Oh, I tried, and I failed miserably. It's such a twisted solution to such a simple problem and also doesn't work as intuitively as you might expect, e.g.:

.SECONDEXPANSION:

foo-%: bar-$$@
		  commands

Hmm, so if I do:

make foo-blah

if should use bar-foo-blah as a prereq, right? Nope. And I'm still not entirely clear why. If anyone can help, please let me know!

Also, why would I want such a crazy thing as above? Why not just use:

foo-%: bar-foo-%
		  commands

Because it doesn't work for multiple targets:

foo1-% foo2-% foo3-%: # gee, what do I put here?
		  commands

I would need to write the rules out manually:

define commands
# put some commands here, saving them into a var so we don't have to retype them for each target
endef
foo1-%: bar-foo1-%
		  ${commands}
foo2-%: bar-foo2-%
		  ${commands}
foo3-%: bar-foo3-%
		  ${commands}

Yes, I can think of cases where I might want this. Perhaps this example doesn't capture the essence of the problem (just the technical details), so it needs to be reworked. And anyway, automatic variables should work everywhere as a matter of principle.

---

weird error messages for pattern rules

If I have a rule:

foo-%.stk: nonexistent_file
#		 this will never fire because the prereq doesn't exist

and I type in:

make foo-bar.stk

Obviously we can't run the command since the prereq isn't there, but what kind of error message do we get? Something about a missing prereq? No! We get this:

make: '''*''' No rule to make target `foo-bar.stk'.  Stop.

But... the rule exists! The rule that we want should be identifiable by make. It's the prereq that's missing.

Funny, if we change it to a non-pattern rule and run the same exact thing:

foo-bar.stk: nonexistent_file
#		 this will never fire because the prereq doesn't exist

make foo-bar.stk

we get the expected message:

make: '''*''' No rule to make target `nonexistent_file', needed by `foo-bar.stk'.  Stop.

Why the difference?

---

make doesn't understand more than one way to address a path in the target

We know that all of the following ls commands will return the same stuff, because all the paths refer to the same dir:

cd /home

ls /tmp
ls /tmp/
ls ../tmp
ls ..///home/..//tmp/

But make doesn't do that in target names, because it uses string matching. This can lead to situations like this:

DIR := /foo/bar/

${DIR}/%/blah.stk:
	@echo will it work?

now run:

make /foo/bar/baz/blah.stk

and note the "no rule to make target" error. However, if we run (note the double slashes):

make /foo/bar//baz/blah.stk

it works! That's half an hour of debugging I want back.

Ah, but I shouldn't be using trailing slashes in my directory names, right?... Well, I can think of situations where I might want to. Regardless, if make is expecting a path, it should treat it semantically (like ls does for example), not as string matching.

---

stupidity with trailing slashes

For the same reason as #6, a rule like this:

foo/%:
	@echo $(*D)

will produce the following differences in what is semantically the same input:

make foo/bar

outputs:

.

while:

make foo/bar/

outputs:

bar

Great. So if I hit "tab" for shell autocompletion and get a trailing slash, I get the second variant. But if the directory name is coming from some "properly" declared var with the trailing slash removed, I get the first. Now I have to watch out for the difference between the two variants and check for it explicitly.

How many Linux geeks have been slain by frustrations with something as idiotic as a trailing slash problem? How many years have been lost over this? Maybe we should have variable typing for paths (dir versus file or something).

---

must give pattern rule targets an explicit slash to make it understand we want a path

More fun with slashes. In ==make=='s defense, they do explain this feature here.

Consider:

prefix_%.stk:
		  @echo the target is $@

if we run:

make prefix_foo/bar.stk

we get the "no rule to make target" error.

But if we do this:

prefix-foo/%.stk:
		  @echo the target is $@

and run the same thing:

make prefix_foo/bar.stk

it'll work as anticipated, except with the disadvantage that the -foo suffix is hardcoded into the directory name, instead of being variable like intended.

Why can't a stem surrounded by explicit text expand to something containing slashes? Oh if only we had regular expressions for targets instead... (see #1).

We are, oddly, allowed this:

%.stk:
		  @echo the target is $@

which works for both this:

make prefix_foo/bar.stk

and this:

make bar.stk

except that now the rule is too general, instead of being confined to paths or files that start with prefix.

---

hard to debug

See #4 for hints why. I won't elaborate on the many other reasons.

---

no ability to define targets that always execute before or after any target

Let's say I have a set of shell commands that I want executed before or after any rule. The former case is useful for setting up a directory structure for your project. The latter case is for cleaning up temp files or doing some sort of logging. It would be nice to have an explicit syntax for this.

See below for some workarounds.

---

no wildcards allowed in PHONY targets

You can't have something like this:

.PHONY: %.done

You have to explicitly define an expansion for every single stem, which is impossible sometimes, but here is an example:

.PHONY: $(allStems:%=%.done)

---

Incorrectly regarded as cons (fixes, workarounds, and tips)

When this page was originally created, I was new to makefiles. Much of the complaining has been due to my inexperience. The next sections are dedicated to fixing some previous complaints that actually have a very legitimate resolution.

General tips

A lot of problems can be solved using functions. You can use call, for example, to write your own function.

Debugging tips

There are some nice built-in options for debugging:

  • make -n
    • show but don't exec commands; always run this (e.g. make -n target_name) before running make for real, to see what it will do without actually doing it
  • make -t
    • touch instead of updating
  • make -p
    • print the make database of rules and variables after first expansion
  • make -d
    • print dependencies & other stuff (although the output is very confusing

It may help to make the first command of every single target in your file to be something like:

some_target: dep1 ... depN
		  @echo "Making $@ from dependencies $^"

At least this way you can tell which rule is firing, which is often not trivial to figure out by looking at make -n output.

Fixes for old complaints

Thanks to Malcolm Cook for enlightening me to many of these.

No ability to define targets that always execute before or after any target

You can do something like:

.PHONY: begin end
begin:
		  commands
end:
		  commands

Then, for each target in your makefile do:

some_target: begin ... end
		  commands

where ... are your other dependencies, if any (can be empty).

Alternatively, for a set of commands that you want to execute before any target, you can just put a raw statement like this anywhere in your makefile:

$(shell ...)

where ... is any set of commands you want to run. When the makefile gets parsed, the function will become expanded: the results will be discarded, but the shell commands will execute as a side effect. So, these commands will be executed before the commands of every target. Unfortunately, there is no equivalent of this for commands to execute after every target.

Can't treat dependency list as an array

So let me see, if I want to get stuff from the prerequisite list, I have these (and only these) automatic variables:

$< $? $^ $+ $|

documented here. But what if I want to get the Nth prereq? Why can't I tread the prereq list as an array to index into?

Solution: use $(word n,$^) (see GNU make manual). In the above example $(word 1,$^), $(word 2,$^) gives you the first prereq, second prereq, etc.

There is no trivial way to get the path to your Makefile

Personally, I like to put my Makefile in the root of whatever project I am working on. As I hop around the project subdirs, it is occasionally useful to run the central Makefile from elsewhere up the ladder, e.g.:

make -f ../Makefile ...

The problem is that if there are any rules that depend on you being in the same dir as the Makefile (as they very often do), they now break.

Of course, you could use absolute paths, but then they break if you move your project dir - also not a solution. Likewise, you could hardcore the absolute path to your project root into a Makefile variable, but then you need to update it whenever the project moves - once again, not a solution. This stuff should be dynamic.

Why can't I get a special variable that tells me where the Makefile I'm running is located?

To resolve this, I wrote this hack, which by the way only works for make v3.80 (so for Jaguar users, no dice):

# save your Makefile's absolute path into $prefixdir

ifeq ($(firstword $(MAKEFILE_LIST)),Makefile)
		  prefixdir := $(shell pwd)
else
		  prefixdir := $(shell cd $(subst Makefile,,$(firstword $(MAKEFILE_LIST))); pwd)
endif

Yeah, that's intuitive.

Except the above doesn't work if:

  • your Makefile isn't named "Makefile"
  • you are including other Makefiles

Another possible fix

MWD=$(dir $(word 1, ${MAKEFILE_LIST}))

now, ${MWD} is the directory holding your makefile (possibly relative).

Yet another fix

The following is more elegant and shorter for getting the absolute path:

MAKEFILE_PATH := $(shell cd $(dir $(word 1, $(MAKEFILE_LIST))) ; pwd)

rules can't take arguments

How do you pass an argument to a rule from the command line? There are many situations where you may want to do this, but can't.

Workarounds:

  • put the arg in an environment variable
  • encode the arg into the target name

Both of these are lame and take 5 times longer (since shell expansion doesn't work for the latter, so I can't hit "tab" to save typing) than if we could just pass args to rule.

Hmmmmm

What exactly do you mean by "rules can't take arguments"? WHat would you like to be able to do? (wonders malcolm_cook@stowers-institute.org)

Here is an example of a pattern I sometimes use:

MYPROG_OPT1=  -seq 'type=gene;seqid=4;start=1;end=90000' -infer subftype=intron -gff subftype=all 
# ...for producing GFF output for just a fragment of the chromosome 4
MYPROG_OPT2=  -seq 'type=gene;seqid=4' -infer subftype=all -gff subftype=all 
# ... for inferring all subfeatures on chromosome 4
MYPROG_OPT=OPT1
# ... which will be the default unless overridden from call to make
MYPROG=./path/to/some/executable/named/myprog  --o1 v1 --o2 --v2  ${${MYPROG_OPT}} 
# ... which is the command line with hardwired common options and OPT1 or OPT2 other options (but without input/output options)

#then, in some rule

myprog/${MYPROG_OPT}_% : %
# PURPOSE: run myprog using chosen option set, taking putting results in dir named after the program and the option set
	${myprog} ${$*} > $@

we could use arbitrary args for many, many things

Andrew Uzilov replies:

Your example shows wrapping up different types of analysis behind short-and-sweet rule names, which is nice. I'm concerned, however, with applying it to select parts of the data.

So let's say, for example, I have 4000 subdirectories with data. My concrete use case is 4000+ multiple genome alignment segments, one segment in each subdir. I want to run some sort of analysis on just a handful of segments - let's say segments 311, 510, and 818. I could of course do something like:

make segment-analysis-311
make segment-analysis-510
make segment-analysis-818

but that takes 3 times the typing as if I could say:

fantasy-make segment-analysis segmentdir/311/ segmentdir/510/ segmentdir/818/

where fantasy-make is a system that I wish could take params after target names. Note that this has the bonus that I can hit "tab" to get autocompletion in the shell.

Now, I could set those values in shell variables, but that's more typing. I could also hackily encode them in the target name, e.g.:

make segment-analysis-311-510-818

and hackily extract them on the other end, but that's even worse.

AH! But you say, an important point of make is automation and workflow - I'm supposed to write rules that can figure out which segments need the analysis automatically and then do it, saving the results. I shouldn't specify which dirs I want on the command line - my makefile should figure out which dirs haven't been analysed yet. It's more systematic, robust, and reproducible.

True. However, many times I find myself piping together a few commands to answer simple, one-time questions whose result is not worth saving. For example, what is the size of the data before I submit a job to SGE? (so I can estimate how long it will take)

It is also useful to write these kinds of rules to do visual spot-checks or to aid the debugging or troubleshooting of your pipeline. I argue that you need the ability to apply your rule to only a subset of the data at a whim. Especially since, in bioinformatics, re-doing a pipeline segment could take days - what if I want to just try the segment on a single alignment, to make sure I've debugged it? Or re-run a small piece of the data with more robust logging, for the same purposes - but this kind of logging would choke the host if I ran it on ALL the data.

And finally, here's something I thought of just now... let's say you want to compare two datasets, or run some analysis that compares/correlates/whatever two arbitrary pieces of data. It would be nice to do something like:

fantasy-make compare model1 model2

As far as I can figure out, you can't have make take two or more things as input and do some sort of joint analysis. You can specify multiple targets, but what we really want here are multiple prereqs.

---

Sure you can. Viz this makefile, named foo.mk

(replies Malcolm Cook)

PREREQS = blat bazz

foo: ${PREREQS}
	echo "$@ was made with these PREREQS: $^" > $@

${PREREQS}:
	touch $@

Now when I call `make -f foo.mk foo`, file foo winds up with contents of `foo was made with these PREREQS: blat bazz`

However, when I call `make PREREQS='segmentdir/510/ segmentdir/818/' -f foo.mk foo`, file foo ends up with contents of `foo was made with these PREREQS: segmentdir/510/ segmentdir/818/`

...which is just what you want, no?

I don't really get your other requirments. But take a look at #Call-Function in the make manual - I think it might get you further with your esoteric uses of make for biopipes (which I like).

I enjoy this thread but not in wiki form. Where better to have it where results are open and saved?

Your directory location is not persistent

This is actually a feature. Regardless, since it's easy to make a novice mistake here, I'm writing it up.

The nice novice mistake is: what's the difference between

whichdir1:
	cd /tmp/ ; \
	pwd

and

whichdir2:
	cd /tmp/
	pwd

The first one prints the contents of /tmp/, the second prints the contents of - guess what - your current directory! That's because every new line of a shell command sequence will start in the original executing directory.

Baffling errors on targets with no commands

Let's say you have something like:

subdirs := foo bar baz

.PHONY: all

# Do everything.
all: $(subdirs:%=%/analysis.done)
%/analysis.done: %/processed_data.tab %/figure.png

%/processed_data.tab: ...other_deps...
		  commands

%/figure.png: ...other_deps...
		  commands

The idea is: you have some data in subdirectories that you want to process/analyze and make a figure, placing the results back into that subdir. The analysis done in each subdir is exactly the same, it's only the underlying data that is different. We loop over the subdirs and carry out the analysis in each.

Now, what do you think happens when you type make all and nothing has been made so far? I would expect that make figures out that, for each subdir, processed_data.tab and figure.png need to be made, then makes them. Nope, instead we get:

make: '''*''' No rule to make target `foo/analysis.done', needed by `all'.  Stop.

The fix for this is to add a command, any command, e.g.:

%/analysis.done: %/processed_data.tab %/figure.png
		  @echo > /dev/null

In our original example, the command body was empty, so it failed. Go figure.

My other idea was that the proper fix is to make phony targets (which %/analysis.done is) explicitly phony:

.PHONY: all $(subdirs:%=%/analysis.done)

This doesn't work. Instead, we get:

make: Nothing to be done for `all'.

You would think that the target all would invoke its dependencies as subroutines, as described here, and those would in turn make the .png and .tab files that want. But it doesn't happen. Go figure.

---

-- Created by: Andrew Uzilov - 01 Mar 2007