Hand Align Transducer

From Biowiki
Jump to: navigation, search

HandAlign Transducer

The String Transducer used by the Hand Align program for Statistical Alignment.

See Transducer Legend for explanations of symbols in this diagram. The transition weights (a, b, r, etc.) are defined below.

This is a graph with borders and nodes that may contain hyperlinks.
About this image

Here a = exp(-\lambda t), b = exp(-\mu t), \mu is the deletion rate, \lambda \leq \mu is the insertion rate, r is the indel extension probability, and t is the evolutionary time (branch length) separating input and output sequences.

Mean gap length

Note that the mean length of an indel is \ell=1/(1-r). In practice, handalign requires the user to specify this length and the parameter r is then recovered as r=1-1/\ell.

Emissions

The transducer is a Moore machine, so emissions may be thought of as occurring within states (M, D, I). The absorption/emission probabilities (not shown in the diagram) are related to an underlying substitution model. Specifically, let R denote an instantaneous point substitution rate matrix with equilibrium probability vector \pi. Denoting the input symbol by x and the output symbol by y, the I-state emits symbols (y) with probability \pi_y, the M-state absorbs/emits symbols (x,y) with probability (conditioned on x) of the matrix exponential \exp(Rt)_{xy}, and the D-state absorbs any symbol (x) with probability 1.

Root model

The transducer shown above models the evolution along a branch, i.e. the probability of a child node given its parent. What about the original sequence - the uber-parent at the root?

The ur-ancestral (root) sequence is modeled as an IID sequence with geometrically distributed length, each character distributed according to \pi. (This may be seen as a simple state machine - e.g. see Singlet Transducer.) The parameter of the length distribution is g =1 - \lambda/\mu, so the mean length is L = (1-g)/g = \lambda / (\mu - \lambda). In practice, handalign requires the user to specify (\mu,L) instead of (\lambda,\mu). The insertion rate \lambda is then recovered as \lambda = \mu L / (L+1).

-- Ian Holmes - 14 Dec 2011