# Phylo Film

## Contents

**Phylo Film**: The Phylogenetic Film Show

This page contains links to some Quicktime and AVI versions of some movies of evolving sequences, and phylogenetic inference algorithms, that I made using Perl, Roger Sayle's Ras Mol, and the Berkeley Mpeg Encoder. (The Perl scripts are available below.)

Literature citations for the underlying stochastic process analysis and inference algorithms can be found at this page: Phylogenetic Alignment Reader

# TKF movies

These movies illustrate the evolution of one-dimensional sequences under a very simple model that includes rate parameters for point substitutions, single-nucleotide indels, and tree branching. This simple model is enough to generate all collinear alignments (although the statistics are a bit unrealistic, in that the gaps look "linear" rather than "affine" & there is no selection at the level of genes or other features).

The underlying indel model in some of these animations is the Thorne Kishino Felsenstein 1991 "links model" (also called "TKF" or "TKF91"), as described in their article in the Journal of Molecular Evolution. The substitution model is Kimura's two-parameter model. I have used RASMOL's Carbon, Nitrogen, Oxygen & Fluorine atoms to represent A,C,G,T residues, because I have no shame. I realise VRML would be better for this, but my coding (like DNA evolution) is a random walk on a stochastic landscape, so sue me.
**(see Kenny Duong's Vrml Mod for a prototype VRML version)**

The point of all this is Phylo Alignment (aka Statistical Alignment), the systematic derivation of sequence analysis algorithms from molecular evolutionary hypotheses. These hypotheses are formulated as a continuous-time Markov chain over sequence space (and possibly over structure space, e.g. Hein & Pedersen's models of gene structure evolution, various recent models of RNA secondary structure evolution).

Phylo-alignment, and in particular the theory of String Transducers, provides a framework that allows us to think coherently about indels on trees in the same way that early work of Jukes-Cantor, Kimura, Felsenstein *et al* provided a systematic framework for thinking about substitutions on trees.

Joe Felsenstein 's 2004 book, "Inferring Phylogenies", has a review of statistical alignment. There's a comprehensive bibliography here on this website.

One reason statistical alignment is cool is that you can measure, analyse and visualise the underlying evolutionary process. That's what these movies are about.

## Simulations of the TKF model

These movies look best in Apple's Quicktime viewer. The AVI conversions have suffered minor loss of quality.

One hundred seconds of the TKF model in all its glory. TKF sequence (Quicktime), TKF sequence (AVI)

An example "trajectory" through sequence space. TKF trajectory (Quicktime), TKF trajectory (AVI)

The TKF model with a "splitting" event, generating a tree. Illustrates how multiple alignment & phylogeny are aspects of the same graphical model inference problem, the dynamic programming solution to which was presented by Hein (PSB, 2001). TKF tree (Quicktime), TKF tree (AVI)

Here's a pretty grainy YouTube of the TKF tree:

<object width="425" height="350"> <param name="movie" value="http://www.youtube.com/v/fmIh-OSt7YU"> </param> <embed src="http://www.youtube.com/v/fmIh-OSt7YU" type="application/x-shockwave-flash" width="425" height="350"> </embed> </object>

## Phylogenetically composed transducers

Actually, although I've said these movies are about "the TKF model", they only illustrate the sequence dynamics. They don't show one of the most important parts of the TKF 1991 paper, which is the way Thorne et al show how to infer evolutionary histories under this model. The alignment can be considered as being emitted by a Pair HMM (see e.g. the book Biological Sequence Analysis by Durbin et al (1998) for a review of Pair HMMs). This is also true - or approximately true - for the more realistic evolutionary models described below. Analysis of multiple sequences involves dynamic programming to an Evolutionary HMM (or EHMM), which is a big state machine made by combining String Transducers (a sort of Pair HMM) on branches of a tree (Holmes, ISMB, 2003; see also Hein, 2001; Paten *et al*; Redelings & Suchard; Lunter; *et al*).

Here's a cartoon of an EHMM in action: </em>Evolutionary HMM (Quicktime), Evolutionary HMM (AVI), Evolutionary HMM (MPG) *and here is a *legend for the EHMM movies (PDF) </p>

Here's a YouTube. Harder to follow what's going on here, as the youtube upload seems to have messed it up a bit. Maybe I should try Google Video or Ifilm or something.

<object width="425" height="350"> <param name="movie" value="http://www.youtube.com/v/EcLj5MSDPyM"> </param> <embed src="http://www.youtube.com/v/EcLj5MSDPyM" type="application/x-shockwave-flash" width="425" height="350"> </embed> </object>

The big complicated-looking machines in these cartoons are phylogenetic arrays of string transducers. Essentially, a String Transducer is a finite state machine with an input tape and an output tape. The output tape from one transducer can be fed into the input of the next, offering systematic ways of chaining transducers together. This gives you a systematic way of designing scoring schemes for multi-sequence HMMs, or (equivalently) for keeping track of (possibly overlapping) indel events on trees.

Here's that legend of the state types for phylogenetically composed transducer/EHMM as an inline PNG image:

## Simulations of other substitution/indel models

The TKF model with an accept/reject step controlled by an order-1 Markov chain. This gives dynamics similar to a time-dependent "Ising model" with a strong energetic bonus for parallel adjacent spins. Except of course that the spins have four orthogonal directions, and can be inserted and deleted as well. Confused? Just watch the movie. (Evolutionary models of this kind have been recently analysed by Lunter & Hein (ISMB 2004), and independently by Siepel & Haussler, using complementary methods.)Ising sequence (Quicktime), Ising sequence (AVI)</p>

A more realistic long indel version of the TKF model, allowing multiple-residue indels (and hence affine gap penalties in the dynamic programming recursions), have been independently developed by Knudsen and Miyamoto (JMB 2003) and by Miklos Lunter & Holmes (MBE 2004). See Phylogenetic Alignment Reader for a long list of refs. Some other pragmatic/approximate models, like Mitchison & Durbin's Tree HMMs, and Thorne et al's fragment models, can model affine gaps to some extent.

Simulation movies of the long indel model **will** eventually be forthcoming on this page...

## Simulations of MCMC statistical alignment algorithms

These MCMC sampling steps are used in my Handel software, described in the article by Holmes & Bruno, Bioinformatics, 2001. A more general algorithm has been independently developed by Hein and Jensen. Important optimizations come from Redelings and Suchard. Historically, these algorithms are influenced by Gibbs sampling (introduced to bioinformatics by Lawrence and Liu), and by the probabilistic molecular evolutionary alignment ideas of Bishop and Thompson (more).

These sampling steps, which are no more complex than Pair HMM alignment, are sufficient to eventually explore all alignment space (for a given tree), and so may be helpful in using more sophisticated models for statistical alignment. (Indeed, Handel can be used to sample multiple alignments from any probability distribution, using importance sampling.)

*Branch-sampling* involves fixing all of the tree, except for one branch. Branch-sampling (Quicktime), Branch-sampling (AVI)

*Node-sampling* involves adding and deleting residues to inferred ancestral nodes. Node-sampling (Quicktime), Node-sampling (AVI)

*Parent-sampling* is a sort of stochastic version of progressive alignment. Parent-sampling (Quicktime), Parent-sampling (AVI)

## Measurement of evolutionary parameters

The TKF model can be parameterised fast using EM, using the xrate and tkfidem programs in my DART package (checkout anonymous CVS). A reference for xrate is Holmes & Rubin (JMB 2002).

Of course, generic MCMC and gradient-ascent methods can be used to parameterise statistical alignment, since it's based on likelihood models. Wouldn't it be fun to have species- and even gene family-specific evolutionary simulation movies parameterised, and modelled, as accurately as possible using available published genome sequence.... hmmmm

# Animations of phylo-grammars

These are animations of phylo grammars as modeled by xrate software.

A phylo-grammar is a model incorporating correlations between sequences (via substitution models on phylogenetic trees) and within sequences (via grammatically structured features representing protein-coding genes, ncRNAs, etc).

- cds.mpg (MPEG)
- collapsed-ehmm.mpg (MPEG)
- complex-cds.mpg (MPEG)
- ehmm.mpg (MPEG)
- ghmm.mpg (MPEG)
- protein.mpg (MPEG)
- pseudoknot.mpg (MPEG)
- rna.mpg (MPEG)

# Animation codes

## Ian Holmes

The bead-on-a-string TKF sequence animations were made using a hacked copy of Rasmol and a perl script tkf.pl.

The phylogenetically-composed transducer animations were made using a perl script ehmm.pl and the Berkeley Mpeg Encoder.

A more recent version of this script, distributed with DART, takes any Stockholm format alignment as input.
This script is called ** phylodirector**.

The phylo-grammar animations were made with another perl script treealign.pl and the Berkeley Mpeg Encoder.

## Kenny Duong

- Vrml Mod is a VRML reworking of tkf.pl

# Authors

Page put together by Ian Holmes.

Code by Ian Holmes, Kenny Duong and others.

This page was relocated to biowiki from the following page on 11/16/2006: