Difference between revisions of "Xrate Software"
(Imported from TWiki)
m (Move page script moved page XrateSoftware to Xrate Software: Rename from TWiki to MediaWiki style)
Latest revision as of 22:43, 1 January 2017
- 1 XRATE
- 2 Using xrate
- 2.1 General tips
- 2.2 AJAX web application
- 2.3 Command-line usage
- 2.4 Sequence of operations
- 2.5 Examples
- 2.6 Interfaces to scripting languages
- 3 Further documentation
- 3.1 File format specifications
- 3.2 Background info on phylo-grammars
- 3.3 Presentations
- 3.4 Animations
- 3.5 Tutorials
- 3.6 References
- 3.7 Related tools
- 4 Contact
A page about xrate, an open-source package for phylogenetic comparative genomics.
xrate is an open-source package for phylogenetic comparative genomics. Its capabilities include maximum likelihood phylogeny, ancestral sequence reconstruction, alignment annotation and model estimation.
Technically, xrate is an interpreter for phylo-grammars, offering
- fast EM-based parameterisation of evolutionary grammars for annotation of multiple sequence alignments.
- maximum-likelihood analysis of multiple alignments and phylogenies using phylo-grammars.
- maximum a posteriori reconstruction of ancestral sequences, using point substitution models or phylo-grammars.
- estimation of expected transition counts and dwell times for continuous-time Markov chains on trees.
- utilities for working with & displaying annotated alignments & trees.
The programs and related utilities are distributed in the DART package. See below for details.
xrate's design was inspired by several other tools: the generic dynamic programming engine Dynamite, the phylogenetic hypothesis testing package HyPhy and various specialized stochastic grammars including PFold, HMMer, ExoniPhy, PhastCons, & RNA-Decoder.
A reasonably self-contained description of most of the theory behind phylo-grammars can be found in Ian Holmes' graduate class lecture notes (see e.g. chapter 3, section entitled "Evolutionary Hidden Markov models"). A working definition can be found on the phylo grammars page.
Briefly, xrate brings together in one package a lot of the capabilities of several other software tools for molecular evolutionary analysis, such as the "phylo-HMMs" and "phylo-grammars" of Siepel, Pedersen, Haussler et al, the RNA folding grammars of Knudsen and Hein, the protein secondary structure HMMs of Goldman, Thorne and Jones, the 3-state Felsenstein & Churchill: A Hidden Markov Model approach to variation among sites in rate of evolution. Mol. Biol. Evol. 1996;13:93-104. and the various models of HyPhy or PAML.
xrate allows a wide family of such grammars to be specified using Xrate Format, a simple format based on Lisp SExpressions. A grammar can be "trained" on a database of multiple alignments for measurement of evolutionary rates. Alternatively (or additionally), the grammar can be used to annotate alignments with predicted features (genes, conserved elements etc.) The program features an easy-to-use Unix command-line interface, and a detailed logging system.
Training (or, more precisely, maximum likelihood parameter estimation) uses the Expectation Maximization algorithm for phylogenetic grammars. For speed, the implementation uses some tricks, such as eigenvector decomposition of the rate matrices.
Of course, ML (in general) and EM (in particular) are not perfect, and the training algorithm can get stuck at suboptimal solutions. With some care, such problems can often be avoided by careful choice of initial seed parameters and pseudocounts. See known issues with DART for more info on this and other issues.
xrate is distributed as part of the DART package. To install it, do the following:
- Download the DART package (see downloading dart for instructions on how to do this)
- Type cd dart; ./configure; make xrate (see building dart for more info)
- The xrate executable is created in the dart/bin subdirectory.
- The example grammars are in the dart/grammars subdirectory.
xrate can auto-estimate trees if they're missing from the dataset. It is recommended to split the training into two steps for purposes of reproducibility/debugging:
- estimate trees for the training alignments and save to an intermediate file;
- do the actual training.
The Known Issues with DART page has lots of useful heuristic tips and guidelines.
One frequently asked question is "How much training data do I need?" Follow the link for a back-of-envelope calculation.
AJAX web application
The general usage is:
xrate [ options ] ALIGNMENT_FILE
The alignment, which should be in Stockholm format, is annotated and printed to standard output.
The program can also be used as a filter:
cat ALIGNMENT_FILE | xrate [ options ]
Some of the most commonly-used command-line options are:
|-h||print long help message including all command-line options|
|-g FILE||load xrate format grammar from specified file|
|-t FILE||train grammar (i.e. estimate parameters from data) and output to file|
|-e FILE||optimize tree using null model from specified grammar file|
|-a||annotate the alignment using the most likely parse tree (actually, this is turned on by default)|
|-noa||turn off the annotation step (saves a small amount of time if annotation is not required, e.g. when training the model)|
|-c||report confidence levels (i.e. posterior probabilities) for the maximum-likelihood annotation|
|-ar||ancestral reconstruction: estimate most probable sequences at missing nodes|
|-log N||print diagnostic log messages down to numeric level N (e.g. -log 5). Lower log levels are more verbose; 9 is the default (almost silent). See Dart Logging|
|-log TAG||print log messages with tag TAG (e.g. -log RATE_EM). This option may significantly slow things down, since it pulls in a lot of regexp code. See Dart Logging|
Note that many of these options have longer (possibly easier-to-remember) synonyms. For example, --grammar instead of -g, or --train instead of -t.
For a complete list of options & their synonyms, type:
Useful logging directives
The following log tags are quite useful to monitor miscellaneous aspects of program execution, such as long-running jobs, memory-intensive jobs or grammar preprocessing:
|Option||What it does|
|-log ALLOC||Prints a log message before trying to allocate memory for dynamic programming matrices|
|-log ECFGDP||Displays progress during a dynamic programming matrix fill|
|-log ECFG_EM||Displays progress during EM training|
|-log SEXPR_INCLUDE||Prints a log message whenever a file is included by the preprocessor|
|-log SEXPR_EXPAND||Prints a log message for every element visited by the preprocessor in an iteration macro|
Most log tags are undocumented... try grepping the source code for CTAG or CLOG if you're keen :-)
Sequence of operations
The order of operations that occurs when the program runs is as follows:
- The alignment database is loaded into memory
- If a separate tree-estimation grammar was specified, it is used to fit phylogenetic trees:
- First, missing trees are estimated by neighbor joining
- Next, branch lengths of all trees are optimized using the EM algorithm
- Any macros in the grammar file are expanded
- If the "training" option was specified then the Inside-Outside (or Forward-Backward) and EM algorithms are used to train the grammar, which is then saved to a file
- The expected counts calculated during the last round of EM are also saved to the trained grammar file
- For each alignment, the following annotation steps are executed:
- If the "annotation" option was specified then the CYK (or Viterbi) algorithm is used to find the maximum-likelihood parse for each alignment in the database
- The "annotation" option is actually the default; to turn off this behavior, you need the --noannotate option (-noa for short)
- The ML parse for each alignment is used to annotate predicted features in the alignment
- If the "confidence" option was specified, then the posterior probabilities of the maximum-likelihood feature annotations are calculated using the Inside-Outside (or Forward-Backward) algorithm
- If the "posterior probability" option was specified, then the posterior probability of every possible feature annotation is calculated using the Inside-Outside (or Forward-Backward) algorithm
- If the "sum score" option was specified, then the likelihood of each alignment (summed over parses) is calculated using the Inside (or Forward) algorithm
- Finally, the alignment is printed to standard output, together with any scores, feature annotations or posterior probabilities that were calculated
Simple annotation example
xrate -g grammar.eg align.stk
This does the following:
- Load alignment from Stockholm Format file “align.stk”.
- This file is assumed to include a phylogenetic tree. If the tree is not present and needs to be estimated on-the-fly, a point substitution model should be specified using the -e option (see example below).
- Load grammar from Xrate Format file “grammar.eg”.
- Estimate tree by neighbor-joining .
- Do CYK algorithm (or Viterbi, depending on whether grammar is an HMM or SCFG) to find most likely parse.
- Annotate output alignment using most likely parse.
Simple training example
xrate -g grammar2.eg -e nullmodel.eg -t trained.eg -noa -log 5 align2.stk
This does the following:
- Load alignment from file “align2.stk”.
- Load grammar from file “grammar2.eg”.
- Estimate phylogenetic tree using substitution model in grammar file "nullmodel.eg", if there isn’t a tree annotated to the alignment already (first by neighbor-joining, then EM on the branch lengths).
- Train the grammar by EM, using the Inside-Outside algorithm (if grammar is an SCFG) or Forward-Backward (if it's an HMM).
- Save trained grammar to file “trained.eg”.
The -log 5 option implies that log messages of level 5 and higher will be displayed.
Estimating an amino acid matrix
The above examples are rather abstract. Here's something a bit more concrete: how to estimate an amino acid matrix.
Then do something like this:
% cd dart % xrate my_protein_alignment.stock -e grammars/nullprot.eg -g grammars/nullprot.eg -t my_amino_acid_matrix.eg -log 6
Complex ncRNA gene prediction example
See Xrate Pipeline (documentation is pretty rough right now).
Interfaces to scripting languages
A set of (barely documented) Perl modules for interfacing to dart programs exists in the dart/python directory.
Andreas Heger has written a set of python modules in the dart/python directory.
File format specifications
xrate grammar files
- Grammar file format reference: xrate format
Alignments, phylogenies & annotations
- Alignment files: Stockholm format
Background info on phylo-grammars
- Teaching.BioE241 notes (UC Berkeley Bioengineering graduate class)
- Biowiki page on phylo grammars
- Siepel & Haussler paper on phylo-HMMs
- UCSC-June7-2006.mov: presentation given at UCSC's CBSE, 6/7/2006 (19 MB Quicktime movie)
- xgram.ppt: xgram presentation from lab meeting on 3/10/2006 (3 MB powerpoint presentation)
Several animations of phylo-grammars can be found on the PhyloFilm page.
The xrate program itself is described in the following paper:
- Klosterman et al.: XRate: a fast prototyping, training and annotation tool for phylo-grammars. BMC Bioinformatics 2006;7:428.
- Supplementary material. Includes description of EM algorithm for irreversible substitution models
Other relevant papers:
- Holmes & Rubin: An expectation maximization algorithm for training hidden substitution models. J. Mol. Biol. 2002;317:753-64.
To build simgram download DART and then type cd dart; make simgram or just cd dart; make all. The compiled binary lands in dart/bin.
For a list of command-line options type simgram --help.
Not quite all phylo-grammars that can be described by an xrate-format file can be simulated by simgram. In particular, macro expansion is limited (alignment- and tree-dependent macros are not expanded) and there are no indels, nor hidden classes in substitution models.
The dartlog.pl script uses ANSI terminal color to visualize the information in dart logfiles.
The colorstock.pl script uses ANSI terminal color to visualize basepairing patterns in Stockholm alignments of RNA (or DNA) sequences that have been annotated with secondary structures.
copyparams.pl is a short script that copies selected parameter values from one xrate model file to another. It is useful for modularized training of complex models.
Other Perl scripts
See the following pages for more utilities relating to xrate and its various file formats:
Bug reports and feature requests
Feature requests belong in the xgram wish list.
Unless, that is, they relate solely to the file format, in which case they belong in the xgram format wish list.
Bug reports are welcomed! The following links may be helpful:
- Known issues with DART -- contains some useful tips & tricks
- DART bug reporting -- please read this before submitting a bug report!
This is an evolving description. Please feel free to add questions/comments (or email them to Ian Holmes), or to create your own tutorials as separate pages on this wiki.