Felsenstein Wildcards

From Biowiki
Jump to: navigation, search

Felsenstein wildcards

When using phylogenetic models to reconstruct ancient sequence, it is often useful to separate the task into two steps:

  1. Imputation of of the alignment and indel structure, i.e. which residues of the ancestral sequence are aligned to which present-day residues;
  2. Imputation of the residues themselves, i.e. the actual sequence (conditioned on the alignment imputed in step 1).

This is always possible if the underlying indel model is independent of the substitution model (as e.g. in the TKF model or the Long Indel model).

During step 1, which is often the most computationally challenging step, one is effectively considering all possible ancestral residues, and we can therefore think of the ancestral genotypes as sequences of "wildcards" at this stage. When calculating the likelihood of a particular alignment, one sums over the actual values of such residues, using Felsenstein's pruning algorithm. At step 2, posterior probability distributions over the actual residues themselves can be found using Elston-Stewart peeling (aka the sum-product algorithm).

Conventionally, such summed-out residues are often represented as asterisks. The term "Felsenstein wildcard" was introduced in the following paper:

A Google search for the phrase turns up several programs for & papers on paleogenomics and statistical alignment.

Note: this approach can only be used if the indel model is independent of the actual sequence, so that inference of the sequence itself can be postponed. For example, a lexicalized transducer that modeled microsatellite expansion and contraction would not allow for the use of Felsenstein wildcards during alignment, since the indel rates in such a model depend on neighboring sequence.

-- Ian Holmes - 01 Aug 2007