Known Issues With DART

From Biowiki
Jump to: navigation, search

Known issues with DART

This page is for frequently asked questions, bug reports, anecdotal usage information, help, community wisdom, complaints, whatever; as long as it's about DART. See also DART bug reporting for guidelines on the cool way to report a bug.

Numerical instability of highly singular matrix

Sometimes, if a rate matrix is extremely singular or weird, the numerical routines in EXPOKIT, EISPACK and New Mat (the matrix libraries cannibalized by dart) or in other mysterious code procured from various shady vendors of logic (not looking at Gerton Lunter) can have certain problems with it. For example, the MatrixExpEigenPrepare function in DartSrc:newmat/ appears to compute the wrong eigenvectors for the matrix R=\left(\begin{array}{rrr}0 & 0 & 0\\1 & -1 & 0\\0 & 0 & 0\end{array}\right)

An example grammar file using this rate matrix R (or actually 0.999*R) is attached to this page:

To reproduce the bug on this grammar file, run xrate exactly as shown, and look for the final call to MatrixExpEigenPrepare:

echo | xrate -g -log RATE_EM -log RATE_EM_NEWMAT

Note that xrate transforms the matrix before diagonalizing it, so the matrix that the program is trying (and failing) to diagonalize is actually \left(\begin{array}{rrr}0 & 0 & 0\\0.706617 & -0.999307 & 0\\0 & 0 & 0\end{array}\right)

The correct eigenvalues and eigenvectors may e.g. be verified here

Unfortunately at the moment there is no explicit test for such matrices; they just return numerically incorrect results :-( Fortunately, "most" matrices conventionally encountered in statistical molecular phylogenetics do not appear to be as badly-behaved as this.

Precision warnings ("Warning: log-likelihood dropped..." and "Failed EM improvement threshold...")

Could you give some clarification about warnings like this in XRATE?
- "Warning: log-likelihood dropped from -1.82377e+06 to -1.82413e+06 bits during EM"
- "Failed EM improvement threshold for the 3th time; stopping"

Those warnings are due to numerical precision limitations within xrate. Due to several aspects of the way xrate's EM is implemented - most significantly, the transformation to eigenvector basis by matrix diagonalization (which introduces some small errors) and the use of lookup tables to accelerate the function PSum(A,B)=log(exp(A)+exp(B)) (which also introduces small rounding errors), xrate's log-likelihoods tend to be accurate to only ~3 decimal places.

These precision errors don't matter much at the start of EM, but when you get near the maximum of the likelihood curve, they can cause the following problems:

  1. The EM algorithm is supposed to always increase the likelihood; however, due to the aforementioned precision issues, the likelihood can occasionally decrease by a small amount (this explains your first warning);
  2. If EM is working "perfectly", a common criterion for stopping the EM algorithm is that the log-likelihood stops increasing (or only increases by <X% per iteration, where X is a low number). However, if there is a bit of noise in the EM (e.g. due to these precision issues), it can sometimes occur that the log-likelihood dips transiently, but then continues to rise. As a workaround for this, xrate has a "forgive" parameter that permits it to tolerate a few "bad" consecutive iterations of EM (a "bad" iteration being one where the log-likelihood decreases, or more generally, fails to increase by X% over the previous iteration). By default, this "forgive" parameter is set (somewhat arbitrarily) to 3. So when the likelihood fails to meet the required increase for three iteration in a row, EM is stopped. This explains your second warning.

To make my earlier statement that "xrate's log-likelihoods tend to be accurate to only ~3 decimal places" more precise, I should now say that what I meant is that xrate's log-likelihoods can only be expected to reliably increase while the increases are more than .1% of the total log-likelihood; thus, the default value for X (in point#2 above) is .1 (this is the "mininc" command-line parameter to xrate; the parameter value is actually .001 since it's specified as a fraction, not a percentage).

The above appears to be a worst-case scenario; empirically, the "mininc" parameter can often be set much lower with apparently perfectly fine results. For example, when benchmarking against PAML, we set it to 1e-6 in the hope of getting better precision, and it worked. That was for this paper: Heger et al.: Accurate estimation of gene evolutionary rates using XRATE, with an application to transmembrane proteins. Mol. Biol. Evol. 2009;26:1715-21.

My hunch is that the worst problems occur when the rate matrix is "ill-behaved" (i.e. hard to diagonalize - e.g. because it is sparse, or strongly irreversible, etc) and that most of the problems are to do with this.

EM convergence

(See also tips for codon matrices and precision warnings)

The DART program xrate uses a version of the EM algorithm to estimate evolutionary rates (Holmes & Rubin: An expectation maximization algorithm for training hidden substitution models. J. Mol. Biol. 2002;317:753-64.) and/or tree branch lengths (which are reciprocal to rates).

The EM algorithm is notoriously prone to getting stuck in local minima. For rate-EM, this becomes a particular problem when you "seed" the EM algorithm with a grammar containing fast rates (or long branches).

Our hypothesis is that this is because longer branch lengths generate more "noise". On a long branch, the most likely scenario is that there were lots of substitutions on it, even if the ancestor and descendant are in the same state. However, on a short branch, the expected number of substitutions that occurred is more tightly coupled to the observed ancestral and descendant states.

Another way to visualize this is to plot the Jukes-Cantor log-likelihood for an alignment with X matches and Y mismatches (with X>0 and Y>0) (see Durbin et. al. Section 8.3 for a calculation of the likelihood of two sequences under the Jukes-Cantor model). The slope is very steep for t<tmax, and very shallow for t>tmax. So for a greedy algorithm that gives up when the slope gets too shallow, you get more accurate results by starting with smaller t. Here's an example with X=7 and Y=3:


This may be an oversimplification since EM is not strictly the same as gradient ascent.

Generally, and this is an empirical observation, the EM algorithm converges much better if you start with short branch lengths and low rates and allow the EM algorithm to increase them, rather than starting with high rates and hoping the algorithm will reduce them.

Tree macro warnings ("Warning: more than one alignment in database...")

I wonder whether it is possible to train a
grammar on several datasets, i.e. several Stockholm files, each one
containing a tree ("#=GF NH" line) and an alignment.

The trick is that I want to be able to use unrelated datasets: the
sequences are not the same in the different Stockholm files. If I use
the simple syntax:

xrate -g start.gra -t trained.gra -log 5 <file1.sto> <file2.sto>
<file3.sto> ......

then the log says that each of the alignments is read but then:
"Warning: more than one alignment in database. Using tree from first
alignment for tree-related macros."

Is there a way to train a grammar simultaneously on several different
datasets, inducing several unrelated trees?

Those warnings are a bit misleading, and should probably be removed or clarified. They shouldn't be a problem in your case. Basically, there are some advanced "macro" features of the xrate grammar syntax that allow you to do things like use a different rate matrix for every branch of the tree (or for one branch, or some subset of branches, etc).

In that case, the grammar file needs to "know" about the tree somewhat, and this is implemented via macros such as "&foreach-node" that are expanded when the tree is loaded.

In that case, you really do have a problem if you try to train that grammar on more than one alignment, because the macros will expand differently for each tree, so you effectively have a different grammar for each tree. Thus, attempting to train one grammar on several different alignments (each with its own tree) would not make sense. That's what the warning is for.

Clearly, it's a bit of a distraction in your case, where you are not using these macros. Training a single grammar (which doesn't have tree macros) on a file full of multiple alignments is perfectly fine, and a common "use case" for xrate.

By the way, in case you are interested, the macros are documented here: Xrate Macros

The macros constitute a dialect of the Scheme programming language (it's also possible to embed Scheme code directly within the grammar, if you're so inclined, and if you compile xrate against the GNU "guile" library).

Inconsistent trees

In order to estimate rates and/or annotate alignments, xrate first needs to estimate a tree. To do this it needs a substitution model. If a grammar is loaded from a file, then xrate attempts to find an appropriate substitution model in that grammar. If, however, xrate is invoked with a "preset" grammar, then a special logic kicks in (specifically, xrate attempts to do a quick fit of an HKY85 substitution model, then uses this to estimate branch lengths for neighbor-joining, etc). So if you estimate a tree with xrate, then save the grammar to a file, re-load it, and re-estimate the tree, you might get different results (since the latter tree will not be based on the HKY85 model, but rather whatever model was in the grammar). This logic is counter-intuitive and might one day be fixed.

Weird numbers in my tree

"After running xrate -b, all the branch lengths in my tree were 0.02!!!"

This is a known bug, but not as serious as you might think. 0.02 is the lowest value that xrate will round a branch length to. (It should be 0.01, but there's a bug somewhere that rounds it up.)

If you see branch lengths of 0.02, like the tree below, it usually means that the EM algorithm would have liked to set those branch lengths closer to zero (due to highly similar sequences), but xrate wouldn't let it.

#=GF NH (((((([[Dro Mel]]:0.02,DroSim:0.02):0.02,DroYak:0.02):0.02,DroEre:0.02):0.02,
				[[Dro Ana]]:0.02):0.02,DroPse:0.02):0.02,([[Dro Moj]]:0.02,DroVir:0.02):0.02);

Clearly the artificial limit of 0.02 needs to be smaller, or at least user-specifiable.... it's on the to-do list. ;-)

Update, May 2006: this was in fact due to a bug in Dart's Brent optimization routine in DartSrc:util/maximise.h . It is now fixed - Ian Holmes

Logging is slooooooow!!

Yes, due to the way logging is implemented (with fancy wildcards and such), if you specify any log tags on the command line, the code executes a (probably unnecessary) regular expression every time it passes a logging checkpoint. So, you probably don't want to be using log tags for your high-throughput jobs.

However, setting a numerical log level does not execute these regexps. So using a numerical log level (like -log 5) as opposed to a tag (like -log TREE_EM) should be much faster.

Memory allocation error

When working with SCFGs, xrate allows you to specify the fold envelope via the --length option, which is the maximum length of non-suffix subsequences (i.e. subseqs [i,j] of seq [1,L] where i != 1 and j !=L) considered by the Inside, Outside, and CYK algorithms. This constrains the size of their dynamic programming matrices. So, you will get a memory/runtime improvement, at the expense of a possibly lower quality parse. Setting --length to -1 (default setting) will remove the constraint.

If the fold envelope is too large, you may encounter a memory allocation error such as:

terminate called after throwing an instance of 'std::bad_alloc'

If you turn on logging to a sufficiently low level, (eg, -log 3), you will see this in the log file:

Initialising fold envelope: N columns, max subseq length = -1

Here N is the number of alignment columns. The fold envelope takes O(N^3) space. More specifically, the memory usage is O(NS^2) where S is the maximum subsequence length considered by the SCFG parsing algorithms. In this case, "max subseq length = -1" means that there is no subseq length limit, so S = N and the memory usage is O(N^3).

You will frequently encounter memory limitations if you're working with data of appreciable size. The moral is that it's good to explicitly set the maximum subseq length with the --length option. This is a principled approach: it doesn't make sense to consider subsequences longer than the longest biological feature you're searching for. For example, if you're doing a genome scan for ncRNAs, you would set --length to the length of the scan window, which corresponds to the longest ncRNA the program can successfully search for.

Community wisdom

How much training data do I need?

Follow the link for a back-of-envelope calculation.

Do "guided" training

As has been noted, EM is not the most reliable of algorithms, and can get stuck in local optima. MCMC is of course the right solution to this, for all sorts of reasons (accurate estimates of error being chief amongst them). However, if you're desperately in need of a quick maximum likelihood answer, there are some tricks you can use to avoid local maxima.

Guided training, as described in our 2002 Holmes & Rubin: An expectation maximization algorithm for training hidden substitution models. J. Mol. Biol. 2002;317:753-64., is one such trick. The basic idea is to start by training a restricted parameterisation of the model you are interested in. When that is trained, you incrementally add more parameters to the model, then re-train.

See guided training page for more detail.

Start with a low seed

As noted elsewhere on this page, the rate-EM algorithm empirically works better if you start with a seed matrix that has low rates (or a seed tree that has short branches), and allow the EM algorithm to increase them as necessary, rather than starting with fast rates (or long branches) and hoping that EM will trim them down. (Generally, it won't, because fast rates generate less accurate estimates for the number of events that have occurred, and these estimates are the basis for the EM algorithm.)

Inspect likelihoods

When trying to adjudicate between two alleged maximum likelihood estimates the first thing to check is the likelihood. Most DART programs print out likelihoods, or can be forced to do so using log messages.

Use logging

DART programs have a wild variety of log messages that can be turned on or off using the "-log" option. Use "-help" for a list of permissible log tags, or scan the source code. You can also pass a numerical (integer) argument to "-log": lower numbers print more information. The range is from 9 down to about -3. The default (9) is silent; you start getting useful info around level 6.

Most debugging messages that have ever been put in there are still accessible via some log tag or other. Some of them, such as the posterior expectation counts for EM, are really really useful. (For reference these are -log RATE_EM for estimating rates and -log TREE_EM for estimating branch lengths of trees.)

If you really get into the logging, there is a neat perl script, dart/perl/, that renders logfiles in mind-bending ANSI terminal color, including date/time info and source file links. Ah, logging.

Check the last line of the logfile

In the event of error, check the last line of the logfile. Assuming you had verbose enough logging turned on, this often gives a clue as to the last thing the program was trying to do before it croaked, and can be useful in diagnosing out-of-memory errors and the like.