Hand Align Transducer
The String Transducer used by the Hand Align program for Statistical Alignment.
See Transducer Legend for explanations of symbols in this diagram. The transition weights (a, b, r, etc.) are defined below.
Here , , is the deletion rate, is the insertion rate, is the indel extension probability, and is the evolutionary time (branch length) separating input and output sequences.
Mean gap length
Note that the mean length of an indel is . In practice, handalign requires the user to specify this length and the parameter is then recovered as .
The transducer is a Moore machine, so emissions may be thought of as occurring within states (M, D, I). The absorption/emission probabilities (not shown in the diagram) are related to an underlying substitution model. Specifically, let denote an instantaneous point substitution rate matrix with equilibrium probability vector . Denoting the input symbol by x and the output symbol by y, the I-state emits symbols (y) with probability , the M-state absorbs/emits symbols (x,y) with probability (conditioned on x) of the matrix exponential , and the D-state absorbs any symbol (x) with probability 1.
The transducer shown above models the evolution along a branch, i.e. the probability of a child node given its parent. What about the original sequence - the uber-parent at the root?
The ur-ancestral (root) sequence is modeled as an IID sequence with geometrically distributed length, each character distributed according to . (This may be seen as a simple state machine - e.g. see Singlet Transducer.) The parameter of the length distribution is , so the mean length is . In practice, handalign requires the user to specify instead of . The insertion rate is then recovered as .
-- Ian Holmes - 14 Dec 2011