click on the Biowiki logo to go to homepage
Edit Raw Print
Links Diffs RSS
About Stats Recent
Research Teaching | Blog
Main | JBrowse | TWiki
Biowiki > Main > Xgram Software > KnownIssuesWithDART

Search

Advanced search...

Topics

PageRank Checker

Known issues with DART

This page is for bug reports, anecdotal usage information, help, community wisdom, complaints, whatever; as long as it's about DART. See also DART bug reporting for guidelines on the cool way to report a bug.

EM convergence

(See also tips for codon matrices)

The DART programs xrate, xfold, xgram, etc, all use a version of the EM algorithm to estimate evolutionary rates (Holmes I, Rubin GM.  An expectation maximization algorithm for training hidden substitution models.  J Mol Biol. 2002 Apr 12;317(5):753-64.) and/or tree branch lengths (which are reciprocal to rates).

The EM algorithm is notoriously prone to getting stuck in local minima. For rate-EM, this becomes a particular problem when you "seed" the EM algorithm with a grammar containing fast rates (or long branches).

Our hypothesis is that this is because longer branch lengths generate more "noise". On a long branch, the most likely scenario is that there were lots of substitutions on it, even if the ancestor and descendant are in the same state. However, on a short branch, the expected number of substitutions that occurred is more tightly coupled to the observed ancestral and descendant states.

Another way to visualize this is to plot the Jukes-Cantor log-likelihood for an alignment with X matches and Y mismatches (with X>0 and Y>0) (see Durbin et. al. Section 8.3 for a calculation of the likelihood of two sequences under the Jukes-Cantor model). The slope is very steep for t<tmax, and very shallow for t>tmax. So for a greedy algorithm that gives up when the slope gets too shallow, you get more accurate results by starting with smaller t. Here's an example with X=7 and Y=3:

xgraph.png

This may be an oversimplification since EM is not strictly the same as gradient ascent.

Generally, and this is an empirical observation, the EM algorithm converges much better if you start with short branch lengths and low rates and allow the EM algorithm to increase them, rather than starting with high rates and hoping the algorithm will reduce them.

Inconsistent trees

In order to estimate rates and/or annotate alignments, xgram (+xfold+xprot) first needs to estimate a tree. To do this it needs a substitution model. If a grammar is loaded from a file, then xgram attempts to find an appropriate substitution model in that grammar. If, however, xfold is invoked with a "preset" grammar, then a special logic kicks in (specifically, xfold attempts to do a quick fit of an HKY85 substitution model, then uses this to estimate branch lengths for neighbor-joining, etc). So if you estimate a tree with xfold, then save the grammar to a file, re-load it, and re-estimate the tree, you might get different results (since the latter tree will not be based on the HKY85 model, but rather whatever model was in the grammar). This logic is counter-intuitive and will soon be fixed.

Weird numbers in my tree

"After running xfold -b, all the branch lengths in my tree were 0.02!!!"

This is a known bug, but not as serious as you might think. 0.02 is the lowest value that xfold will round a branch length to. (It should be 0.01, but there's a bug somewhere that rounds it up.)

If you see branch lengths of 0.02, like the tree below, it usually means that the EM algorithm would have liked to set those branch lengths closer to zero (due to highly similar sequences), but xfold wouldn't let it.

#=GF NH ((((((DroMel:0.02,DroSim:0.02):0.02,DroYak:0.02):0.02,DroEre:0.02):0.02,
            DroAna:0.02):0.02,DroPse:0.02):0.02,(DroMoj:0.02,DroVir:0.02):0.02);

Clearly the artificial limit of 0.02 needs to be smaller, or at least user-specifiable.... it's on the to-do list. wink

Update, May 2006: this was in fact due to a bug in Dart's Brent optimization routine in DartSrc:util/maximise.h . It is now fixed - Ian Holmes

Logging is slooooooow

Yes, due to the way logging is implemented (with fancy wildcards and such), if you specify any log tags on the command line, the code executes a (probably unnecessary) regular expression every time it passes a logging checkpoint. So, you probably don't want to be using log tags for your high-throughput jobs.

However, setting a numerical log level does not execute these regexps. So using a numerical log level (like -log 5) as opposed to a tag (like -log TREE_EM) should be much faster.

Memory allocation error

When working with SCFGs, xrate (xgram, xfold, etc.) allows you to specify the fold envelope via the --length option, which is the maximum length of non-suffix subsequences (i.e. subseqs [i,j] of seq [1,L] where i != 1 and j !=L) considered by the Inside, Outside, and CYK algorithms. This constrains the size of their dynamic programming matrices. So, you will get a memory/runtime improvement, at the expense of a possibly lower quality parse. Setting --length to -1 (default setting) will remove the constraint.

If the fold envelope is too large, you may encounter a memory allocation error such as:

terminate called after throwing an instance of 'std::bad_alloc'

If you turn on logging to a sufficiently low level, (eg, -log 3), you will see this in the log file:

Initialising fold envelope: N columns, max subseq length = -1

Here N is the number of alignment columns. The fold envelope takes O(N^3) space. More specifically, the memory usage is O(NS^2) where S is the maximum subsequence length considered by the SCFG parsing algorithms. In this case, "max subseq length = -1" means that there is no subseq length limit, so S = N and the memory usage is O(N^3).

You will frequently encounter memory limitations if you're working with data of appreciable size. The moral is that it's good to explicitly set the maximum subseq length with the --length option. This is a principled approach: it doesn't make sense to consider subsequences longer than the longest biological feature you're searching for. For example, if you're doing a genome scan for ncRNAs, you would set --length to the length of the scan window, which corresponds to the longest ncRNA the program can successfully search for.

Community wisdom

How much training data do I need?

Follow the link for a back-of-envelope calculation.

Do "guided" training

As has been noted, EM is not the most reliable of algorithms, and can get stuck in local optima. MCMC is of course the right solution to this, for all sorts of reasons (accurate estimates of error being chief amongst them). However, if you're desperately in need of a quick maximum likelihood answer, there are some tricks you can use to avoid local maxima.

Guided training, as described in our 2002 EM paper, is one such trick. The basic idea is to start by training a restricted parameterisation of the model you are interested in. When that is trained, you incrementally add more parameters to the model, then re-train.

See guided training page for more detail.

Start with a low seed

As noted elsewhere on this page, the rate-EM algorithm empirically works better if you start with a seed matrix that has low rates (or a seed tree that has short branches), and allow the EM algorithm to increase them as necessary, rather than starting with fast rates (or long branches) and hoping that EM will trim them down. (Generally, it won't, because fast rates generate less accurate estimates for the number of events that have occurred, and these estimates are the basis for the EM algorithm.)

Inspect likelihoods

When trying to adjudicate between two alleged maximum likelihood estimates the first thing to check is the likelihood. Most DART programs print out likelihoods, or can be forced to do so using log messages.

Use logging

DART programs have a wild variety of log messages that can be turned on or off using the "-log" option. Use "-help" for a list of permissible log tags, or scan the source code. You can also pass a numerical (integer) argument to "-log": lower numbers print more information. The range is from 9 down to about -3. The default (9) is silent; you start getting useful info around level 6.

Most debugging messages that have ever been put in there are still accessible via some log tag or other. Some of them, such as the posterior expectation counts for EM, are really really useful. (For reference these are -log RATE_EM for estimating rates and -log TREE_EM for estimating branch lengths of trees.)

If you really get into the logging, there is a neat perl script, dart/perl/dartlog.pl, that renders logfiles in mind-bending ANSI terminal color, including date/time info and source file links. Ah, logging.

Check the last line of the logfile

In the event of error, check the last line of the logfile. Assuming you had verbose enough logging turned on, this often gives a clue as to the last thing the program was trying to do before it croaked, and can be useful in diagnosing out-of-memory errors and the like.

Actions: Edit | Attach | New | Ref-By | Printable view | Raw view | Normal view | See diffs | Help | More...