Tags:
create new tag
, view all tags

ProtPal

ProtPal is a software tool for multiple sequence alignment, ancestral reconstruction, and measurement of indel rates on a phylogenetic tree. The mathematical details of ProtPal's algorithm are described in detail here.

Download

Protpal is distributed as part of the dart package.

This version of dart can be downloaded as a zipfile or tarball from github.com. Once downloaded, unzip/unpack this archive file and follow the installation instructions below.

As an alternative to downloading a tarball, developers who have the 'git' tool installed on their systems can clone the git repository like so:

git clone git://github.com/ihh/dart.git

Installation

ProtPal is currently distributed as part of the dart package. To install it, do the following:

  1. Download and unpack the dart package from the links above
  2. Type cd dart; ./configure; make protpal (see building dart for more info)
  3. The protpal executable is created in the dart/bin subdirectory.

Usage

Basic usage is like this:

protpal -fa SEQUENCE_FILE

SEQUENCE_FILE is assumed to be unaligned sequences in FastaFormat.

Options

Some of the most commonly-used command-line options are:

Option Meaning
-h Print long help message including all command-line options
-stk FILE Load stockholm format sequences from specified file. If this file has a #=GF NH line, this tree will be used for alignment
-t STRING Newick tree string from file
-tf FILE Load newick tree string, in double quotes
-b FILE load handalign point subsitution model (e.g. rate matrix) from specified file
-a Only display leaf alignment (no ancestral sequences); "alignment mode"
-e N Maximum allowed distance between aligned leaf characters (default 300)
-d FLOAT Deletion rate (default .0025)
-i FLOAT Insertion rate (default .0025)
-x FLOAT Gap extend probability (default .9)
-g FILE load xrate-format chain from specified file, for use in final character reconstruction
-m N Maximum number of delete states in sampled DAG (default 1000)
-n N Number of alignment paths to sample at each node (default 10)
-sa Show alignment sampled at each node (default False)
-s Instead of aligning , simulate a set of unaligned sequnces according to the specified models (default False)
-rl INT Instead of sampling from the root transducer, force the root sequence to have this length
-arpp Report posterior probabilities of alternate reconstructions, conditional on ML indel reconstruction
-marp Minimum probability to report for -arpp option (default is 0.01)
-ep Estimate parameters for branch transducer (not yet implemented...coming soon)

It is a good idea, but not essential, that your chain file (Handalign format markov chain substitution model) and grammar file (XrateFormat grammar file, containing a single-character chain) are in accordance.

Example

For a quick look at ProtPal in action, try the following command from the DART directory:

./bin/protpal -stk src/protpal/testing/testSeqs.stk -sa true -n 5 -g data/handalign/prot1.hsm

This runs ProtPal on a small alignment, printing out 5 sampled alignments at each internal node, using the specified chain.

Notes for beta testers, power users, developers

Downloading via git is highly preferred. This way, you'll have access to updates, bug fixes, etc, as soon as we make them.

You can also browse the source code at github.

New feature requests

Authors

Developed by OscarWestesson and IanHolmes .

Practical matters, simulation results

We have applied ProtPal to sequences up to 10kb in length (e.g. small viral genomes) and trees up to size ~600 nodes with reasonable success. One important caveat with ProtPal is that it assumes a known tree, and strictly adheres to the tree structure in creating an alignment. If the tree is wrong, this could cause the resulting alignment to appear "unreasonable". For this reason, caution should be used when attempting to align many sequences (especially if they are distantly related).

We have written a paper describing results on simulated data (submitted). Essentially, ancestral alignments (assigning sequences to all tree nodes, not just leaves) were simulated, and the unaligned leaf sequences were fed to various alignment programs (ProtPal, PRANK, MUSCLE, ProbCons, FSA, CLUSTALW). ProtPal and PRANK are capable of ancestral reconstruction whereas partially-randomized parsimony (in the case of 'ties') was used to augment the remaining programs' alignments to ancestral alignments. Indel counts were tabulated for each and used to estimate insertion and deletion rates for each program. The lambda.pl script packaged with the simulation program DAWG was also used to estimate rates from MUSCLE and FSA alignments. Though it is the only other program (besides ProtPal) to claim to be able to estimate indel rates, it is the least accurate method.

This was repeated for 5 indel rates (0.005, 0.01, 0.02, 0.04, and 0.08) and 3 substitution rates (0.5, 1.0, 2.0), with 100 replicates for each pair of rates. The true rate and the inferred rate are then compared - below we show the ratio of the inferred to true rates aggregated across all rate categories. The "True simulated history" shows the ideal distribution - tightly clustered around inferred/true=1 (red dashed line). The further away the distribution stretches from 1, the worse the rate estimation. The root-mean-squared error (RMSE) is a convenient measure how 'how far from the 1 the distribution tends to stray) - lower numbers indicate a more accurate set of rate estimates. The true simulated history, ProtPal, and ProbCons are the top three by this statistic.

Values greater than one indicate overestimates of the rate, and values less than one indicate underestimates. ProtPal and PRANK (and, surprisingly, ProbCons) are the most accurate aligners - many traditional progressive aligners (e.g. MUSCLE, CLUSTALW) overestimate the presence of deletions and underestimate insertions - a predictable consequence of not modeling indels as phylogenetic events.

Links to the papers describing the algorithm behind ProtPal and the simulation benchmark results will appear here as soon as they become available.

Distribution of indel rate estimation errors

Topic attachments
I Attachment Action Size Date Who Comment
Pngpng estimates_biasescolor.png manage 917.1 K 2012-02-01 - 23:28 OscarWestesson Distribution of indel rate estimation errors
Topic revision: r23 - 2012-02-02 - OscarWestesson
 

This site is powered by the TWiki collaboration platformCopyright © 2008-2014 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
TWiki Appliance - Powered by TurnKey Linux