HandAlign is a tool for Markov Chain Monte Carlo (MCMC) co-sampling of
multiple alignment, phylogenetic tree, ancestral sequences, and evolutionary parameters.
The underlying model of evolution allows for indel events and any model of point substitution events.
It can be represented using a time-dependent FiniteStateTransducer
a state machine that transforms an input
sequence into an output
The actual model is shown on this page: HandAlign transducer
It may be viewed as a short-branch approximation to the LongIndelModel
Several of the MCMC moves sample directly from the model
(via Gibbs sampling and related MCMC techniques)
using dynamic programming algorithms that follow from StringTransducers
HandAlign is distributed as part of the DART
package of bioinformatics sequence analysis tools.
Together with companion utility programs (including PhyloComposer
and several other tools),
HandAlign is a part of the HandelPackage
, which is a subset of DART
Instructions on how to download and build the DART
package can be found on the pages
and Building DART
A few caveats are worth noting if you are specifically interested in using the HandAlign program:
- You will almost certainly want to install Gerton Lunter's hmmoc package before building DART. This package significantly accelerates the dynamic programming steps that align sequences to hidden Markov models. Without it you may find the program rather slow
- GNU Guile is currently not used by handalign, although it may be at some point in the future
Both these prereqs are optional, but the first in particular will make handalign go a lot faster:
is a parser-generator that compiles finite-state automata (hidden Markov models)
into C++ code for the corresponding inference algorithms.
Its use with handalign is highly recommended.
is a library for building Scheme-based extension languages.
It is used to extend several C++ applications in DART
It is highly recommended if you are building any of DART
The program accepts FASTA format
(for unaligned input sequences),
(to supply an initial guess for the alignment) or
Stockholm Newick format
, i.e. Stockholm format with an embedded phylogenetic tree in Newick format
(to supply an initial guess both for the alignment and for the tree).
The program outputs a single alignment (by default this is the highest-scoring alignment found, but the
option directs the program to report the final alignment sampled).
Additionally, the program can record the entire MCMC sampling trace (a series of alignments) to a file.
In all cases, output alignments are in Stockholm Newick format
, i.e. Stockholm format
with an embedded phylogenetic tree in Newick format
The substitution model may be specified using a subset of xrate format
constructions for specifying (respectively)
the residue alphabet and the continuous-time Markov chain substitution model.
Examples of valid substitution model files:
A full list of options with brief descriptions is available via
, i.e. type
The following are some of the most commonly-used options.
| List available options with brief descriptions
| Suppress log messages
| Seed random number generator. See HandAlign#Randomization
| Salt random number generator using PID and clock time. See HandAlign#Randomization
| Specifies the average length of the root sequence. See HandAlignTransducer#Root_model
| Specifies the deletion rate . See HandAlign transducer
| Specifies the average length of a gap. See HandAlignTransducer#Mean_gap_length
| Specifies the substitution model using a subset of xrate format; see Parameters
| Uses an empirical protein substitution model found in DartData:handalign/prot.hsm
| Uses a Jukes-Cantor nucleotide substitution model found in DartData:handalign/jc.hsm
| Specify number of samples (per tree node) to run the MCMC sampler for
| Requests that every
refine-period sampling steps, a locally optimal alignment be sought
| Specifies the number of MCMC sampling steps between each
| Dumps each alignment to a file in Stockholm format
| Specifies the relative frequency of branch-sampling MCMC moves. See HandAlign#MCMC_moves
| Specifies the relative frequency of node-sampling MCMC moves. See HandAlign#MCMC_moves
| Specifies the relative frequency of node-flipping MCMC moves. See HandAlign#MCMC_moves
| Specifies the relative frequency of node-sliding MCMC moves. See HandAlign#MCMC_moves
| Specifies the relative frequency of branch length-sampling MCMC moves. See HandAlign#MCMC_moves
| Specifies the relative frequency of indel parameter-sampling MCMC moves. See HandAlign#MCMC_moves
| Specifies the relative frequency of substitution parameter-sampling MCMC moves. See HandAlign#MCMC_moves
| Unshift a node onto the front of the MCMC sampling queue. Useful to target a particular branch for resampling
| Turn off the RedelingsSuchardKernel. This increases the complexity of individual sampling steps, though it may result in better mixing per step. It is generally not advised, although the Redelings-Suchard kernel is incompatible with some other performance optimizations in handalign, so this option is sometimes necessary (a warning or error message will be printed in such situations)
| Specify a separate executable for calculating phylo-alignment likelihoods for the purposes of importance sampling
| Gives more help on the
| Multiply the likelihood of the
--exec-like program by an indel likelihood, i.e. treat the
-exec-like as a likelihood for substitutions only, conditioned on a particular indel path
| Manually specify the location of hmmoc
Dart offers a number of progress and logging messages that can be suppressed or activated from the command line.
By default, the logging level is set to 5 (intermediate); use the
option to suppress all log messages.
for further details.
By default, the programs in the DART
package operate in
deterministic mode: the random number generator is seeded with the same initial value each run,
for reproducibility. This facilitates testing and debugging.
For the MCMC programs within DART
, such as handalign
especially if you are not testing/debugging the software,
it is likely that you will want it to follow less predictable, more random behavior.
For example, a common practice in MCMC is to run a chain several times from
the same starting point with different random seeds each time.
option allows you to change the initial random seed.
option sets the seed to a random value generated from
a combination of the current process ID and the current clock time in microseconds.
sequence annealing step (c.f. AMAP
followed by a neighbor joining
is the number of sequences and
is the sequence length),
is required to estimate an initial alignment and phylogeny (respectively) for given input sequences
unless initial estimates of the alignment (and optionally the phylogeny) are provided in StockholmFormat
The relative proportions of the various sampling moves can be set on the command line.
All are variants of Gibbs-sampling moves (in that they all hold constant all but one or two variables,
and resample those conditional on the fixed variables).
Some are full Gibbs moves (perfectly mixing the sampled variables at every step);
others utilize Metropolis-within-Gibbs or other variants on Gibbs sampling
such as "Importance-within-Gibbs" where the current point is included in the weighted list of points to be sampled.
The moves vary in the dependence of their complexities on the input sequence length,
- Branch Alignment: A pairwise ancestor-descendant alignment is sampled using DP (dynamic programming) and stochastic traceback. Complexity is .
- Branch Length: The length of an ancestor-descendant branch is sampled by Importance-within-Gibbs, changing the total tree length but maintaining tree topology. Complexity is .
- Node Alignment: A 4-way alignment (3 nodes related by a common central node) is sampled by DP. Complexity is for the full Gibbs step, for Metropolis-within-Gibbs (using the RedelingsSuchardKernel).
- Node Flip: The tree is sampled by exchanging a node with its niece, locally altering the tree topology. The original and flipped topologies are sampled at random (using DP to sum over alignments), and the alignment is then resampled conditional on the topology. Complexity is for full Gibbs step, for Metropolis-within-Gibbs (using the RedelingsSuchardKernel).
- Node Slide: A node is ``slid'' along a branch, preserving the tree's topology and total branch length, and sampling the new node attachment point via Importance-within-Gibbs. Complexity is .
- Indel Parameters: Insertion/deletion parameters are sampled via an Metropolis-within-Gibbs move. Complexity is .
- Substitution Parameters: Free parameters declared in the subsitution model are sampled using the same Metropolis-within-Gibbs method as indel parameters. Complexity is .
In the "Node Alignment" move, which involves aligning three siblings (x,y,z) on a tree and estimating their common ancestor w,
the optional two-stage Redelings Suchard kernel
uses a Metropolis-within-Gibbs approach
to sample the common ancestor w and alignment wxyz
, rather than the
required by the full Gibbs step.
The kernel consists of two pairwise alignment-sampling stages:
first a proposed ancestor w and alignment wxy are sampled conditional on x and y (ignoring z),
then the alignment wz is sampled conditional on w and z.
The final ancestor w and alignment wxyz are then accepted or rejected
using a Hastings ratio.
A similar strategy (using three stages rather than two) is used to propose four-way alignments in the "Node Flip" move.
See the RedelingsSuchardKernel
page for a citation describing this technique.
Analysis of MCMC sampling traces
scripts and modules described in this section may be found in
Summarization of MCMC traces
finds a consensus alignment that summarizes an MCMC run.
can be used to extract the NewickFormat
trees from the StockholmNewickFormat
is not a part of DART
, but rather is found in the PHYLIP
It finds a consensus tree that summarizes the trees visited during an MCMC run.
Visualization of MCMC traces
uses ANSI terminal color to render an animated impression of an MCMC sampling run.
uses ANSI terminal color to highlight alignment columns by posterior probability.
Further analysis of MCMC traces
Perl modules in
allow further extraction of individually sampled trees, alignments or model parameters,
so as to perform additional tests or analysis on the MCMC trace.
These modules are self-documented using POD format
, which may be read with
A longer list of tools for working with Stockholm format is available on the StockholmTools
A brief tutorial is available on this wiki: