ProtPal is a software tool for multiple sequence alignment, ancestral reconstruction, and measurement of indel rates on a phylogenetic tree. The mathematical details of
ProtPal's algorithm are described in detail
here.
Download
Protpal is distributed as part of the
dart package.
This version of
dart can be downloaded as a
zipfile or tarball from github.com.
Once downloaded, unzip/unpack this archive file and follow the
installation instructions below.
As an alternative to downloading a tarball, developers who have the 'git' tool installed on their systems can clone the git repository like so:
git clone git://github.com/ihh/dart.git
Installation
ProtPal is currently distributed as part of the
dart package.
To install it, do the following:
- Download and unpack the
dart package from the links above
- Type
cd dart; ./configure; make protpal (see building dart for more info)
- The
protpal executable is created in the dart/bin subdirectory.
Usage
Basic usage is like this:
protpal -fa SEQUENCE_FILE
SEQUENCE_FILE is assumed to be unaligned sequences in
FastaFormat.
Options
Some of the most commonly-used command-line options are:
Option |
Meaning |
-a |
Only display leaf alignment (no ancestral sequences); "alignment mode" |
-arpp |
Report posterior probabilities of alternate reconstructions, conditional on ML indel reconstruction |
-b FILE |
load handalign point subsitution model (e.g. rate matrix) from specified file |
-d FLOAT |
Deletion rate (default .0025) |
-e N |
Maximum allowed distance between aligned leaf characters (default 300) |
-ep |
Estimate parameters for branch transducer (not yet implemented...coming soon) |
-g FILE |
load xrate-format chain from specified file, for use in final character reconstruction |
-h |
Print long help message including all command-line options |
-i FLOAT |
Insertion rate (default .0025) |
-m N |
Maximum number of delete states in sampled DAG (default 1000) |
-marp |
Minimum probability to report for -arpp option (default is 0.01) |
-n N |
Number of alignment paths to sample at each node (default 10) |
-rl INT |
Instead of sampling from the root transducer, force the root sequence to have this length |
-s |
Instead of aligning , simulate a set of unaligned sequnces according to the specified models (default False) |
-sa |
Show alignment sampled at each node (default False) |
-stk FILE |
Load stockholm format sequences from specified file. If this file has a #=GF NH line, this tree will be used for alignment |
-t STRING |
Newick tree string from file |
-tf FILE |
Load newick tree string, in double quotes |
-x FLOAT |
Gap extend probability (default .9) |
It is a good idea, but not essential, that your chain file (Handalign format markov chain substitution model) and grammar file (
XrateFormat grammar file, containing a single-character chain) are in accordance.
Example
For a quick look at
ProtPal in action, try the following command from the
DART directory:
./bin/protpal -stk src/protpal/testing/testSeqs.stk -sa true -n 5 -g data/handalign/prot1.hsm
This runs
ProtPal on a small alignment, printing out 5 sampled alignments at each internal node, using the specified chain.
Notes for beta testers, power users, developers
Downloading via git is highly preferred. This way, you'll have access to updates, bug fixes, etc, as soon as we make them.
You can also
browse the source code at github.
New feature requests
Authors
Developed by
OscarWestesson and
IanHolmes .
Practical matters, simulation results
We have applied
ProtPal to sequences up to 10kb in length (e.g. small viral genomes) and trees up to size ~600 nodes with reasonable success. One important caveat with
ProtPal is that it assumes a known tree, and strictly adheres to the tree structure in creating an alignment. If the tree is wrong, this could cause the resulting alignment to appear "unreasonable". For this reason, caution should be used when attempting to align many sequences (especially if they are distantly related).
We have written a paper describing results on simulated data (submitted). Essentially, ancestral alignments (assigning sequences to all tree nodes, not just leaves) were simulated, and the unaligned leaf sequences were fed to various alignment programs (
ProtPal,
PRANK, MUSCLE,
ProbCons,
FSA, CLUSTALW).
ProtPal and
PRANK are capable of ancestral reconstruction whereas partially-randomized parsimony (in the case of 'ties') was used to augment the remaining programs' alignments to ancestral alignments. Indel counts were tabulated for each and used to estimate insertion and deletion rates for each program. The lambda.pl script packaged with the simulation program DAWG was also used
to estimate rates from MUSCLE and
FSA alignments. Though it is the only other program (besides
ProtPal) to claim to be able to estimate indel rates, it is the least accurate method.
This was repeated for 5 indel rates (0.005, 0.01, 0.02, 0.04, and 0.08) and 3 substitution rates (0.5, 1.0, 2.0), with 100 replicates for each pair of rates. The true rate and the inferred rate are then compared - below we show the ratio of the inferred to true rates aggregated across all rate categories. The "True simulated history" shows the ideal distribution - tightly clustered around inferred/true=1 (red dashed line). The further away the distribution stretches from 1, the worse the rate estimation. The root-mean-squared error (RMSE) is a convenient measure how 'how far from the 1 the distribution tends to stray) - lower numbers indicate a more accurate set of rate estimates. The true simulated history,
ProtPal, and
ProbCons are the top three by this statistic.
Values greater than one indicate overestimates of the rate, and values less than one indicate underestimates.
ProtPal and
PRANK (and, surprisingly,
ProbCons) are the most accurate aligners - many traditional progressive aligners (e.g. MUSCLE, CLUSTALW) overestimate the presence of deletions and underestimate insertions - a predictable consequence of not modeling indels as phylogenetic events.
Links to the papers describing the algorithm behind
ProtPal and the simulation benchmark results will appear here as soon as they become available.