Hand Align Transducer
HandAlign Transducer
The String Transducer used by the Hand Align program for Statistical Alignment.
See Transducer Legend for explanations of symbols in this diagram. The transition weights (a, b, r, etc.) are defined below.
Here
,
,
is the deletion rate,
is the insertion rate,
is the indel extension probability,
and
is the evolutionary time (branch length) separating input and output sequences.
Mean gap length
Note that the mean length of an indel is .
In practice, handalign requires the user to specify this length and the parameter
is then recovered as
.
Emissions
The transducer is a Moore machine, so emissions may be thought of as occurring within states (M, D, I).
The absorption/emission probabilities (not shown in the diagram) are related to an underlying substitution model.
Specifically, let denote an instantaneous point substitution rate matrix with equilibrium probability vector
.
Denoting the input symbol by x and the output symbol by y,
the I-state emits symbols (y) with probability
,
the M-state absorbs/emits symbols (x,y) with probability (conditioned on x) of the matrix exponential
,
and the D-state absorbs any symbol (x) with probability 1.
Root model
The transducer shown above models the evolution along a branch, i.e. the probability of a child node given its parent. What about the original sequence - the uber-parent at the root?
The ur-ancestral (root) sequence is modeled as an IID sequence with geometrically distributed length,
each character distributed according to .
(This may be seen as a simple state machine - e.g. see Singlet Transducer.)
The parameter of the length distribution is
,
so the mean length is
.
In practice, handalign requires the user to specify
instead of
.
The insertion rate
is then recovered as
.
-- Ian Holmes - 14 Dec 2011