Transducer Composition Errata

From Biowiki
Jump to: navigation, search

Corrections to "Using guide trees to construct multiple-sequence evolutionary HMMs" (Holmes, 2003)

This page will be used to document corrections to the algorithms published in the following paper:

Currently there is only one correction.

Additional documentation of transducer jargon can be found on the phylocomposer page.

Collapsed EHMM transition probability

The correction applies to the following portion of the manuscript as published:

  • Final paragraph, Section "THE COLLAPSED EHMM" (second column, page i154)
    • "For the collapsed transition probability t(\phi,\phi') to be nonzero, it is required [...] or back up (to an ancestor)."

The corrected version, shown below, has been implemented in the DART source code for some time, but has been undocumented save for a few obscure comments in one source file (DartSrc:handel/inline_ehmm.h, constructor for class CollapsedETrans).

The collapsed transition probability t(\phi,\phi') where \phi is the source state and \phi' the destination state can be written as a product of factors:

t(\phi,\phi') = WXYZ

The meaning of these factors is as follows.

---

First, recall the following definitions from the original paper (Holmes, 2003):

  • Tree nodes are numbered 1,2,3...N and are sorted in preorder
  • The "root branch" from node 1 to node 2 must exist, and must be the only branch from node 1
  • Failed to parse (syntax error): {\bf D}(n) is the set of all descendants of node n
  • Branches are numbered by the node they lead to, i.e. 2,3...N
  • T_n is the "branch length" leading to node n, so e.g. T_2 is the root branch length
  • A "branch transducer" is associated with every branch
  • \tau(\psi,\psi';T) is the transition probability from \psi to \psi' for a branch transducer with branch length T
  • \Psi is the state space of the branch transducer (and is assumed to be independent of branch length)
  • \chi(\psi) is the state type of \psi for \psi \in \Psi, and takes values in Failed to parse (unknown function "\tt"): \left\{ {\tt S}, {\tt E}, {\tt W}, \stackrel{xy}{{\tt M}}, \stackrel{x}{{\tt D}}, \stackrel{y}{{\tt I}} \right\}

Note that "branch length" can be a continuous real variable, but it doesn't have to be; in fact it can be any label associated with the branch.

Also recall, for any Collapsed EHMM state \phi, that:

  • \psi_n is the state of the transducer on the branch leading to node n
  • \chi_n \equiv \chi(\psi_n) is the state type of \psi_n
  • \beta(\phi) is the emitter node of \phi, i.e. the highest-numbered Insert node:

Failed to parse (unknown function "\tt"): \displaystyle \beta(\phi) = \max \left\{ 2,\ \ n : \chi_n = \stackrel{y}{\tt I} \right\}

We introduce some new definitions:

  • \gamma(\phi) is the first node that is "eclipsed" by the emitter in state \phi, or N+1 if none are:

Failed to parse (syntax error): \gamma(\phi) = \min \{ N+1,\ \ n : n > \beta(\phi), n \notin {\bf D}(\beta(\phi)) \}

  • \upsilon(\psi,\psi';T) is the effective transition probability via intermediate wait states:

Failed to parse (unknown function "\tt"): \displaystyle \upsilon(\psi,\psi';T) = \tau(\psi,\psi';T) + \sum_{\psi'' \in \Psi:\chi(\psi'') = {\tt W}} \tau(\psi,\psi'';T) \tau(\psi'',\psi';T)

  • m = \beta(\phi') is a shorthand for the destination emitter.

---

The algorithm for deciding whether a transition is allowed is as follows. The composite transducer state is first partitioned into subtrees as in the following table. For every node n within each subtree, the appropriate condition must be satisfied if the transition is to be allowed:

Subtree Condition Term in t(\phi,\phi')
2 \leq n < \beta(\phi') \psi_n = \psi'_n W
n = \beta(\phi') \tau(\psi_n,\psi'_n;T_n) > 0 X
\beta(\phi') < n < \gamma(\phi') Failed to parse (unknown function "\tt"): \upsilon(\psi_n,\psi'_n;T_n) > 0\ \vee\ \left(\psi_n = \psi'_n\ \wedge\ \chi_n={\tt W}\right) Y
\gamma(\phi') \leq n < \gamma(\phi) Failed to parse (unknown function "\tt"): \tau(\psi_n,\psi'_n;T_n) > 0\ \vee\ \left(\psi_n = \psi'_n\ \wedge\ \chi_n={\tt W}\right) Z
\max\left(\gamma(\phi),\gamma(\phi')\right) \leq n \leq N \psi_n = \psi'_n W

A description of factors W,X,Y,Z follows.

---

Factor W prevents transducers from changing state if they are prior to the destination emitter, or eclipsed by both source and destination emitters:

\displaystyle W = \left( \prod_{n=2}^{m-1} \delta(\psi_n = \psi'_n) \right) \times \left( \prod_{n=\max\left(\gamma(\phi),\gamma'(\phi)\right)}^N \delta(\psi_n = \psi'_n) \right)

---

Factor X corresponds to the change of state of the transducer at the destination emitter:

X = \tau (\psi_m, \psi'_m ;T_m)

---

Factor Y collects contributions from descendants of the destination emitter:

\displaystyle Y = \prod_{n = m + 1}^{\gamma(\phi') - 1} y_n

where

Failed to parse (lexing error): y_n = \left\{ \begin{array}{rl} \upsilon(\psi_n,\psi'_n;T_n) & \mbox{if $\chi_n \neq {\tt W}$ or $\chi'_n \neq {\tt W}$} } \\ \delta(\psi_n = \psi'_n) & \mbox{if $\chi_n = \chi'_n = {\tt W}$} \end{array} \right.

These transitions can use intermediate Wait states, which is why we use \upsilon rather than \tau.

---

Factor Z collects contributions from newly eclipsed transducers:

\displaystyle Z = \prod_{n = \gamma(\phi')}^{\gamma(\phi) - 1} z_n

where

Failed to parse (lexing error): z_n = \left\{ \begin{array}{rl} \tau(\psi_n,\psi'_n;T_n) & \mbox{if $\chi_n \neq {\tt W}$} } \\ \delta(\psi_n = \psi'_n) & \mbox{if $\chi_n = {\tt W}$} \end{array} \right.

It is trivial to generalize the above formalism so that the state space \Psi of the branch transducer can depend on the branch label T_n, as in the phylocomposer program.

Ian Holmes - 07 Jun 2007