This page contains links to some Quicktime and AVI versions of some movies of evolving sequences, and sequence inference algorithms, that I made using Perl and Roger Sayle's RASMOL. Perl scripts available on request from ihh at berkeley dot edu.
The movies look best in Apple's Quicktime viewer. The AVI conversions have suffered some loss of detail.
The underlying model in most cases is Thorne, Kishino & Felsenstein's 1991 "links model", as described in their article in the Journal of Molecular Evolution. The substitution model is Kimura's two-parameter model. I have used RASMOL's Carbon, Nitrogen, Oxygen & Fluorine atoms to represent A,C,G,T residues, because I have no shame. I realise VRML would be better for this, but my coding (like DNA evolution) is a random walk on a stochastic landscape, so sue me.
The point of all this is statistical alignment.This is a term introduced by Jotun Hein to describe the systematic derivation of sequence analysis algorithms from molecular evolutionary hypotheses. These hypotheses are formulated as a continuous-time Markov chain over sequence space (and possibly over structure space, e.g. Hein & Pedersen's models of gene structure evolution, various recent models of RNA secondary structure evolution). See for example Felsenstein 's book Inferring Phylogenies (2004) for a review.
One reason statistical alignment is cool is that you can measure, analyse and visualise the underlying evolutionary process. That's what these movies are about.
One hundred seconds of the TKF model in all its glory. TKF sequence (Quicktime), TKF sequence (AVI)
An example "trajectory" through sequence space. TKF trajectory (Quicktime), TKF trajectory (AVI)
The TKF model with a "splitting" event, generating a tree. Illustrates how multiple alignment & phylogeny are aspects of the same graphical model inference problem, the dynamic programming solution to which was presented by Hein (PSB, 2001). TKF tree (Quicktime), TKF tree (AVI)
Actually, although I've said these movies are about "the TKF model", they only illustrate the sequence dynamics. They don't show one of the most important parts of the TKF 1991 paper, which is the way Thorne et al show how to infer evolutionary histories under this model. The alignment can be considered as being emitted by a Pair HMM (see e.g. the book Biological Sequence Analysis by Durbin et al (1998) for a review of Pair HMMs). This is also true - or approximately true - for the more realistic evolutionary models described below. Analysis of multiple sequences involves dynamic programming to an Evolutionary HMM (or EHMM), which is a big state machine made by combining Pair HMMs on branches of a tree (Holmes, ISMB, 2003). Here's a cartoon of an EHMM in action: Evolutionary HMM (Quicktime), Evolutionary HMM (AVI), Evolutionary HMM (MPG) and here is a legend for the EHMM movies (PDF)
The TKF model with an accept/reject step controlled by an order-1 Markov chain. This gives dynamics similar to a time-dependent "Ising model" with a strong energetic bonus for parallel adjacent spins. Except of course that the spins have four orthogonal directions, and can be inserted and deleted as well. Confused? Just watch the movie. (Evolutionary models of this kind have been recently analysed by Lunter & Hein (ISMB 2004), and independently by Siepel & Haussler, using complementary methods.) Ising sequence (Quicktime), Ising sequence (AVI)
A more realistic long indel version of the TKF model, allowing multiple-residue indels (and hence affine gap penalties in the dynamic programming recursions), have been independently developed by Knudsen and Miyamoto (JMB 2003) and by Miklos Lunter & Holmes (MBE 2004). (There are also some slightly less-purist models, like Mitchison & Durbin's Tree HMMs, and Thorne et al's fragment models, that can model affine gaps to some extent.) Simulation movies of the long indel model will eventually be forthcoming on this page...
These MCMC sampling steps are used in my Handel software, described in the article by Holmes & Bruno, Bioinformatics, 2001. Similar algorithms have been independently developed by Hein and Jensen. Historically, these procedures are similar to Gibbs sampling, introduced to bioinformatics by Lawrence and Liu.
These sampling steps, which are no more complex than Pair HMM alignment, are sufficient to eventually explore all alignment space (for a given tree), and so may be helpful in using more sophisticated models for statistical alignment. (Indeed, Handel can be used to sample multiple alignments from any probability distribution, using importance sampling.)
Branch-sampling involves fixing all of the tree, except for one branch. Branch-sampling (Quicktime), Branch-sampling (AVI)
Node-sampling involves adding and deleting residues to inferred ancestral nodes. Node-sampling (Quicktime), Node-sampling (AVI)
Parent-sampling is a sort of stochastic version of progressive alignment. Parent-sampling (Quicktime), Parent-sampling (AVI)
The TKF model can be parameterised fast using EM, using the xrate and tkfidem programs in my DART package (checkout anonymous CVS). A reference for xrate is Holmes & Rubin (JMB 2002).
Of course, generic MCMC and gradient-ascent methods can be used to parameterise statistical alignment, since it's based on likelihood models. Wouldn't it be fun to have species- and even gene family-specific evolutionary simulation movies parameterised, and modelled, as accurately as possible using available published genome sequence.... hmmmm
Ian Holmes, ihh at berkeley dot edu