Gotoh Pair HMM

From Biowiki
Jump to: navigation, search

Serial composition of Singlet Transducer and Gotoh Transducer:

This is a graph with borders and nodes that may contain hyperlinks.
About this image

Notationally we can write this composition as \stackrel{\infty}{\rightarrow} a \stackrel{\Delta T}{\rightarrow} b

Notes:

  1. SI and II are identical states;
  2. WW and WX are identical null states preceding EE, that are trivially eliminated.

If the S and M states in the Gotoh Transducer have identical outgoing transition weights, then the same is true of SS and MM in the above Pair HMM.

There are several options for I/D symmetry. For example:

  1. Separable, identical distributions of gap lengths:
    • p = 0 (no ID->IM, keeps distributions separable);
    • v = g * r (II->II and ID->ID);
    • d = f/(e + f) (IM->II and IM->ID);
    • d = x/(w + x) (IM->II and II->ID).
  1. Perfect exchangeability between I and D states:
    • d = g * f (IM->II and IM->ID);
    • v = g * r (II->II and ID->ID);
    • p = g * x (ID->II and II->ID).
  1. Exchangeability between I and D states when order of I's and D's is summed out:
    • d * w = g * f * q (IM->II->IM and IM->ID->IM);
    • v = g * r (II->II and ID->ID).

Note that these three options carry successively weaker assumptions about the order of insertions & deletions. In all cases, the joint distribution over the total number of I's and D's is symmetric, so that e.g. P(IDD)+P(DID)+P(DDI)=P(DII)+P(IDI)+P(IID). Option #2 implies that individual terms in this equation will cancel, e.g. P(IDD)=P(DII) and P(IDI)=P(DID). Option #1 implies that gaps can only appear in one rigid order (I's before D's), so that P(IDD)=P(IID) and all other terms are zero.

In addition to the above constraints, there are the constraints inherent to the Gotoh Transducer:

  • Probabilistic normalization:
    • a + b + c = 1,
    • d + e + f = 1,
    • p + q + r = 1,
    • v + w + x = 1.
  • S/M symmetry:
    • a = d;
    • b = e;
    • c = f.

This makes 14 parameters and either 11 constraints (for separable gap lengths), 10 constraints (for perfect I/D exchangeability) or 9 constraints (for order-independent exchangeability).

This leaves 3 free parameters for separable gaps (for example g, d & v), 4 parameters for perfect I/D exchangeability (for example g, d, v & p) and 5 parameters for order-dependent exchangeability (g, d, v, p, x).

The I/D symmetry constraints amount to a form of detailed balance. The condition of being initially at equilibrium imposes a further constraint. This can be seen e.g. by eliminating the X-tape (i.e. the ID state) and comparing the resultant marginalized Y-emitter with the Singlet Transducer.

For example, assuming separable gap lengths:

This is a graph with borders and nodes that may contain hyperlinks.
About this image

Here

  • a' = a + gcp/(1-gr)
  • b' = b + gcq/(1-gr)
  • c' = c + gc(q+r)/(1-gr)
  • v' = v + gxp/(1-gr)
  • w' = w + gxq/(1-gr)
  • x' = x + gx(q+r)/(1-gr)
  • d' = d + gfp/(1-gr)
  • e' = e + gfq/(1-gr)
  • f' = f + gf(q+r)/(1-gr)

Assuming S/M symmetry, this Y-emitter is guaranteed to generate the same geometric sequence length distribution as the Singlet Transducer if the IM->EE and II->EE transitions both have probability h. (Are these the ONLY conditions under which it's a geometric distribution?) That is, if:

  • e + f' = 1 (IM->EE);
  • w + x' = 1 (II->EE).

Substituting & rearranging, these become:

  • d = gfq/(1-gr)
  • v = gfx/(1-gr)

Consider the reversible separable-gap length model, which has three free parameters (g, d, v). These two constraints would seem to leave only one free parameter.... which is paradoxical. (I think you expect three: e.g. the equilibrium sequence length distribution parameter (g), the gap opening probability (d) and the gap length distribution (v)). So, it appears that it isn't possible to have a transducer of this form that keeps insertions separate from deletions, is reversible and starts at equilibrium.

Now consider perfect I/D exchangeability (which mingles I's with D's, but is still reversible). You start with four free parameters (g, d, v, p). The two constraints reduce the parameter set to two free parameters (which I think can be g & d). This seems almost plausible, but it's still a little weird that you don't have a third parameter for the gap length distribution.

The weakest set of constraints is found when you have exchangeability between I and D states when the order of I's and D's is summed out. The five free parameters (g, d, v, p, x) are reduced to three (g, d, v) by the initial-equilibrium constraints. This seems to be the only scheme that allows reversibility, initial equilibrium & a free choice of gap open, gap extension and equilibrium length parameters.

(An assumption in the above reasoning is that the ONLY way the marginalized Y-emitter generates a geometric distribution is when IM->EE and II->EE both have probability h...? I'm pretty sure this is true though... in that anything else will be a nontrivial sum of geometric distributions...)

-- Ian Holmes - 25 Feb 2007