Introduction

The following sketch of my group's interests in "rate grammars" is not intended to be complete, or even 1% complete. There is much more to rate grammars than is presented below. For example, this page currently does not mention the use of homomorphisms in grammar analysis; nor does it describe the many TKF variants known to our group. I have not gone into detail about the various MCMC frameworks that we have devised to sample trajectories/histories/alignments, nor our application of the EM algorithm to the TKF model. I could bang on about connections to complexity theory and spin glasses and so forth... but this is not meant to be a full paper; just a taster.

I am curently writing an introductory article on rate grammars that will touch on many of these aspects. In writing this webpage, it is my intention to give a flavour of the variety of models and algorithms that are "common lore" in several groups I have worked, particularly Jotun Hein's Bioinformatics Group at Oxford. In return I ask that, if you use these ideas in a publication, presentation or software package, you mention biowiki.org (check back here first to see if there is anything in print yet!).

This is a very exciting time to be working on statistical alignment and rate grammars: there is a good critical mass in several places, including Berkeley and Oxford, and researchers are attacking many of the deep, fundamental problems. At the same time, we maintain an active interest in more "applied" biological problems such as RNA structure, regulatory networks and molecular ecology, making this an excellent environment for theorist and practicioner alike. Applications for pre- and post-doctoral positions are welcomed.

Ian Holmes < ihh at berkeley dot edu >, June 2002


Rate grammars and statistical alignment

In molecular evolution and bioinformatics, sequences are often assumed to evolve by a series of independent mutation events. At any given instant, the probability of a given mutation Sold-->Snew depends only on the sequences (Sold,Snew) and not on the past or future state of the system. Statistically, this is a continuous-time stationary Markov model, where the state space is the set of all DNA (or protein) strings {S}, nicknamed "sequence space". An evolutionary process, or "history", is a path of random jumps through sequence space, and we can think of multiple alignment and phylogeny as "join-the-dots" exercises given a set of points in sequence space. More properly, we seek to calculate the likelihood of a given alignment(tree) by summing over all consistent histories, and to find the alignment(tree) that optimises this likelihood for a given dataset.

Sometimes "the state of the system" is taken to include hidden variables that are not explicitly encoded in the sequence, such as the reading frame of genes, or the boundaries separating protein domains, or even the rate of substitution at each site. This corresponds to an augmentation of the state space (sequence space) so that there are dimensions to it we cannot see with an ABI sequencing machine. These hidden dimensions must be "summed out" somehow to calculate likelihoods for actual sequence data. Adding in such variables may compromise the "purity" of the model (although the variables can represent experimentally observable things like the reading frame), but they can add some degree of realism while keeping the model tractable.

For sequence evolution models that do not use hidden variables (and for some that do), a useful way to specify the model completely is to use a rate grammar. This is akin to a stochastic grammar---a set of probabilistic rules for transforming strings---except now, instead of probabilities for each rule, we have rates. Unlike probabilities, rates do not have to sum to one; instead we require that the total mutation rate of a string is finite. Using the + symbol for string concatenation and |S| for the length of a string, we can then (for example) write

Rate( Sleft+Sdel+Sright --> Sleft+Sright ) = exp(-const. *|Sdel|)

to indicate that the rate of deletion of any subsequence Sdel decays exponentially with the length of that subsequence, independently of the flanking sequence. Where there are degenerate rule applications we conventionally sum all applicable rates so that, in the above example, the total mutation rate from AAAG to AAG is three times the total mutation rate from ACCG to CCG, because in the former case there is a threefold degeneracy but in the latter case the deleted residue is unambiguously identified.

Rate grammars are extensible. We can introduce new rules for things like insertions and substitutions, and indeed for duplications, inversions, transpositions, microsatellite expansions and other complex mutations. We can also tweak the rates e.g. to account for occasional rejection of deleterious mutations. Without a means to relate the models to data, all this is navel-gazing somewhat, but there is plenty of data out there; the hard thing is writing code to analyse it.

One useful class of rules is the local mutations,

Rate( Sleft+Scut+Sright --> Sleft+Spaste+Sright ) = f(Sleft,Sright,Scut,Spaste)

including point substitutions (|Scut|=|Spaste|=1), insertions (|Scut|=0,|Spaste|>0) and deletions (|Scut|>0,|Spaste|=0). Models consisting solely of these latter three types of mutation are called SID (substitution/insertion/deletion) models. If the function f is independent of Sleft and Sright then the model is context-insensitive.

A set of rules with rates like this describes an evolutionary process in sequence space. Suppose an equilibrium probability distribution over sequences P(S) exists. A typical condition of useful "neutral" models is reversibility, implied by the detailed balance condition P(Sold)*Rate(Sold-->Snew) = P(Snew)*Rate(Snew-->Sold), i.e. the expected rates of opposing mutations must match at equilibrium. Writing the model as a rate grammar can reveal the symmetry implications of detailed balance and help us see that, for example, in reversible context-insensitive SID models the equilibrium sequence length distribution must always be geometric, as is the ratio of insertion to deletion rates as a function of subsequence length (proof left as an exercise for the reader).

The simplest realistic (or at least ergodic) rate grammar is the Thorne-Kishino-Felsenstein (TKF) 1991 "links model", which is equivalent to a reversible context-insensitive SID model subject to two constraints: (i) only one residue can be inserted or deleted per event; (ii) the rate of deletion events is independent of the residue being deleted (i.e. of Scut). This model is analytically tractable (using independent birth-death processes for each site) leading to statistical dynamic programming algorithms for pairwise and multiple alignment of biological sequences (Thorne, Kishino & Felsenstein, Journal of Molecular Evolution, 1991; Hein, Pacific Symposium on Biology, 2001; Holmes & Bruno, Bioinformatics, 2001). Recent advances such as the rate EM (Expectation Maximisation) algorithm (Holmes & Rubin, Journal of Molecular Biology, 2002) offer the possibility of parameterising multiple alignment programs quickly & optimally from data, bypassing the baroque lore of gap penalties and mismatch costs common to sequence alignment tools.

The TKF model can be extended in tractable ways to allow evolutionary heterogeneity, e.g. by stacking "fast" and "slow" domains end-to-end and allowing insertion or deletion of whole domains. Effectively, this corresponds to adding hidden variables in the form of domain labels and boundaries (links). An open project in the group is to use such composite models to build an accurate, extensible multiple alignment/profiling tool that can be statistically trained from data. This work will be informed by benchmark results for my Handel suite of programs based on the TKF model. Related project directions include the construction of databases of mutation rates for peptide domains of interest, the investigation of "microdomain" models representing fragments of helices, sheets or turns, the design of functional classifiers based on mutational profiles and the use of heterogenous models to predict medically informative SNP sites.

Unfortunately (or interestingly, depending on how you look at it), most halfway-realistic rate grammars are much harder to solve than the TKF model and we must resort to more expensive MCMC methods, such as sampling from the posterior distribution of evolutionary histories. Another project in the group focuses on the development of such methods, with the aim of directly measuring the rates of a wide range of genomic mutations (big indels, context-sensitive substitutions, duplications, inversions) that are not well-handled by existing techniques. In particular we are keen to explore systematic MCMC movesets such as trajectory edits that are "local" in some sense (e.g. permutations, splits and merges of mutations that are adjacent on the sequence or adjacent in time) and use of stochastic grammars (HMMs, SCFGs) as proposal distributions for MCMC movement through alignment-space.

Our work on evolutionary rate grammars will be closely tied to analysis of real biological data. Close collaborations are planned with Chris Ponting's group in the nearby MRC Functional Genetics unit, investigating the evolution and organisation of protein domains. Mutations in noncoding genome sequence are another area of study. Our general philosophy is to demonstrate the superiority of evolutionary alignment methods by applying statistical algorithms to interesting biological problems. In particular, all members of the group will be encouraged to develop a test/benchmark set based on a "pet domain" including multiple alignments, literature references, crystallographic structures (where available) and other relevant data.