A page about xrate, an open-source package for phylogenetic comparative genomics.
xrate is an open-source package for phylogenetic comparative genomics.
Its capabilities include maximum likelihood phylogeny, ancestral sequence reconstruction, alignment annotation and model estimation.
Technically, xrate is an interpreter for phylo-grammars, offering
The programs and related utilities are distributed in the DART package.
See below for details.
xrate's design was inspired by several other tools:
the generic dynamic programming engine Dynamite,
the phylogenetic hypothesis testing package HyPhy
and various specialized stochastic grammars including
A reasonably self-contained description of most of the theory behind phylo-grammars
can be found in Ian Holmes' graduate class lecture notes
(see e.g. chapter 3, section entitled "Evolutionary Hidden Markov models").
A working definition can be found on the phylo grammars page.
Briefly, xrate brings together in one package a lot of the capabilities of several other
software tools for molecular evolutionary analysis,
such as the "phylo-HMMs" and "phylo-grammars" of Siepel, Pedersen, Haussler et al,
the RNA folding grammars of Knudsen and Hein,
the protein secondary structure HMMs of Goldman, Thorne and Jones,
the 3-state Churchill-Felsenstein phylo-HMM
and the various models of HyPhy or PAML.
xrate allows a wide family of such grammars to be specified using XrateFormat, a simple format
based on LispSExpressions.
A grammar can be "trained" on a database of multiple alignments for measurement of evolutionary rates.
Alternatively (or additionally), the grammar can be used to annotate alignments with predicted features (genes, conserved elements etc.)
The program features an easy-to-use Unix command-line interface, and a detailed logging system.
Training (or, more precisely, maximum likelihood parameter estimation)
uses the ExpectationMaximization algorithm for phylogenetic grammars.
For speed, the implementation uses some tricks, such as eigenvector decomposition of the rate matrices.
Of course, ML (in general) and EM (in particular) are not perfect, and the training algorithm can get stuck at suboptimal solutions.
With some care, such problems can often be avoided by careful choice of initial seed parameters and pseudocounts.
See known issues with DART for more info on this and other issues.
- fast EM-based parameterisation of evolutionary grammars for annotation of multiple sequence alignments.
- maximum-likelihood analysis of multiple alignments and phylogenies using phylo-grammars.
- maximum a posteriori reconstruction of ancestral sequences, using point substitution models or phylo-grammars.
- estimation of expected transition counts and dwell times for continuous-time Markov chains on trees.
- utilities for working with & displaying annotated alignments & trees.
xrate is distributed as part of the DART package.
To install it, do the following:
- Download the DART package (see downloading dart for instructions on how to do this)
cd dart; ./configure; make xrate (see building dart for more info)
xrate executable is created in the
- The example grammars are in the
xrate can auto-estimate trees if they're missing from the dataset. It is recommended to split the training into two steps for purposes of reproducibility/debugging:
Step 1 requires a grammar containing a rate matrix over individual residues (e.g. DartGrammar:nullprot.eg for amino acids, DartGrammar:jukescantor.eg or DartGrammar:hky85.eg for nucleotides).
The Known Issues with DART page has lots of useful heuristic tips and guidelines.
One frequently asked question is "How much training data do I need?" Follow the link for a back-of-envelope calculation.
- estimate trees for the training alignments and save to an intermediate file;
- do the actual training.
AJAX web application
XREI is a (no-longer maintained) web interface to xrate.
and offers visualization of xrate models using BubblePlots and GraphViz state diagrams.
The general usage is:
The alignment, which should be in Stockholm format, is annotated and printed to standard output.
The program can also be used as a filter:
xrate [ options ] ALIGNMENT_FILE
cat ALIGNMENT_FILE | xrate [ options ]
Some of the most commonly-used command-line options are:
Note that many of these options have longer (possibly easier-to-remember) synonyms.
| print long help message including all command-line options
| load xrate format grammar from specified file
| train grammar (i.e. estimate parameters from data) and output to file
| optimize tree using null model from specified grammar file
| annotate the alignment using the most likely parse tree (actually, this is turned on by default)
| turn off the annotation step (saves a small amount of time if annotation is not required, e.g. when training the model)
| report confidence levels (i.e. posterior probabilities) for the maximum-likelihood annotation
| ancestral reconstruction: estimate most probable sequences at missing nodes
| print diagnostic log messages down to numeric level N (e.g.
-log 5). Lower log levels are more verbose; 9 is the default (almost silent). See DartLogging
| print log messages with tag TAG (e.g.
-log RATE_EM). This option may significantly slow things down, since it pulls in a lot of regexp code. See DartLogging
--grammar instead of
--train instead of
For a complete list of options & their synonyms, type:
Useful logging directives
The following log tags are quite useful to monitor miscellaneous aspects of program execution,
such as long-running jobs, memory-intensive jobs or grammar preprocessing:
Most log tags are undocumented... try grepping the source code for
|| What it does
| Prints a log message before trying to allocate memory for dynamic programming matrices
| Displays progress during a dynamic programming matrix fill
| Displays progress during EM training
| Prints a log message whenever a file is included by the preprocessor
| Prints a log message for every element visited by the preprocessor in an iteration macro
CLOG if you're keen
Sequence of operations
The order of operations that occurs when the program runs is as follows:
- The alignment database is loaded into memory
- If a separate tree-estimation grammar was specified, it is used to fit phylogenetic trees:
- First, missing trees are estimated by neighbor joining
- Next, branch lengths of all trees are optimized using the EM algorithm
- Any macros in the grammar file are expanded
- If the "training" option was specified then the Inside-Outside (or Forward-Backward) and EM algorithms are used to train the grammar, which is then saved to a file
- The expected counts calculated during the last round of EM are also saved to the trained grammar file
- For each alignment, the following annotation steps are executed:
- If the "annotation" option was specified then the CYK (or Viterbi) algorithm is used to find the maximum-likelihood parse for each alignment in the database
- The "annotation" option is actually the default; to turn off this behavior, you need the
--noannotate option (
-noa for short)
- The ML parse for each alignment is used to annotate predicted features in the alignment
- If the "confidence" option was specified, then the posterior probabilities of the maximum-likelihood feature annotations are calculated using the Inside-Outside (or Forward-Backward) algorithm
- If the "posterior probability" option was specified, then the posterior probability of every possible feature annotation is calculated using the Inside-Outside (or Forward-Backward) algorithm
- If the "sum score" option was specified, then the likelihood of each alignment (summed over parses) is calculated using the Inside (or Forward) algorithm
- Finally, the alignment is printed to standard output, together with any scores, feature annotations or posterior probabilities that were calculated
Simple annotation example
This does the following:
xrate -g grammar.eg align.stk
- Load alignment from StockholmFormat file “align.stk”.
- This file is assumed to include a phylogenetic tree. If the tree is not present and needs to be estimated on-the-fly, a point substitution model should be specified using the
-e option (see example below).
- Load grammar from XrateFormat file “grammar.eg”.
- Estimate tree by neighbor-joining .
- Do CYK algorithm (or Viterbi, depending on whether grammar is an HMM or SCFG) to find most likely parse.
- Annotate output alignment using most likely parse.
Simple training example
This does the following:
xrate -g grammar2.eg -e nullmodel.eg -t trained.eg -noa -log 5 align2.stk
- Load alignment from file “align2.stk”.
- Load grammar from file “grammar2.eg”.
- Estimate phylogenetic tree using substitution model in grammar file "nullmodel.eg", if there isn’t a tree annotated to the alignment already (first by neighbor-joining, then EM on the branch lengths).
- Train the grammar by EM, using the Inside-Outside algorithm (if grammar is an SCFG) or Forward-Backward (if it's an HMM).
- Save trained grammar to file “trained.eg”.
-log 5 option implies that log messages of level 5 and higher will be displayed.
Estimating an amino acid matrix
The above examples are rather abstract. Here's something a bit more concrete: how to estimate an amino acid matrix.
You first need to get your alignment data into a Stockholm format file, e.g.
my_protein_alignment.stock (see Stockholm tools for file format conversion utilities).
Then do something like this:
% cd dart
% xrate my_protein_alignment.stock -e grammars/nullprot.eg -g grammars/nullprot.eg -t my_amino_acid_matrix.eg -log 6
Complex ncRNA gene prediction example
See XratePipeline (documentation is pretty rough right now).
Interfaces to scripting languages
A set of (barely documented) Perl modules for interfacing to dart programs exists in the
The PhyloGram perl module can be used to construct complex grammars.
(Of course, what's really needed are some Lisp constructs to do this...! The XrateMacros are a start.)
AndreasHeger has written a set of python modules in the
File format specifications
xrate grammar files
Alignments, phylogenies & annotations
Background info on phylo-grammars
Several animations of phylo-grammars can be found on the PhyloFilm page.
The xrate program itself is described in the following paper:
Other relevant papers:
simgram program generates sample alignments given an xrate format phylo-grammar file
and a Newick format phylogenetic tree.
simgram download DART and then type
cd dart; make simgram
cd dart; make all.
The compiled binary lands in
For a list of command-line options type
Not quite all phylo-grammars that can be described by an xrate-format file can be simulated by
In particular, macro expansion is limited (alignment- and tree-dependent macros are not expanded)
and there are no indels, nor hidden classes in substitution models.
The visualizeRates.pl script renders BubblePlots from xrate format grammar files.
The dartlog.pl script uses ANSI terminal color to visualize the information in dart logfiles.
The colorstock.pl script uses ANSI terminal color to visualize basepairing patterns
in Stockholm alignments of RNA (or DNA) sequences that have been annotated with secondary structures.
copyparams.pl is a short script that copies selected parameter values from one xrate model file to another.
It is useful for modularized training of complex models.
Other Perl scripts
See the following pages for more utilities relating to xrate and its various file formats:
Bug reports and feature requests
Feature requests belong in the xgram wish list.
Unless, that is, they relate solely to the file format, in which case they belong in the xgram format wish list.
Bug reports are welcomed!
The following links may be helpful:
This is an evolving description.
Please feel free to add questions/comments (or email them to IanHolmes),
or to create your own tutorials as separate pages on this wiki.