Transducer Composition Errata
Corrections to "Using guide trees to construct multiple-sequence evolutionary HMMs" (Holmes, 2003)
This page will be used to document corrections to the algorithms published in the following paper:
- Holmes &: Using guide trees to construct multiple-sequence evolutionary HMMs. Bioinformatics 2003;19 Suppl 1:i147-57.
Currently there is only one correction.
Additional documentation of transducer jargon can be found on the phylocomposer page.
Collapsed EHMM transition probability
The correction applies to the following portion of the manuscript as published:
- Final paragraph, Section "THE COLLAPSED EHMM" (second column, page i154)
- "For the collapsed transition probability to be nonzero, it is required [...] or back up (to an ancestor)."
The corrected version, shown below, has been implemented in the DART source code for some time, but has been undocumented save for a few obscure comments in one source file (DartSrc:handel/inline_ehmm.h, constructor for class CollapsedETrans).
The collapsed transition probability where is the source state and the destination state can be written as a product of factors:
The meaning of these factors is as follows.
---
First, recall the following definitions from the original paper (Holmes, 2003):
- Tree nodes are numbered 1,2,3...N and are sorted in preorder
- The "root branch" from node 1 to node 2 must exist, and must be the only branch from node 1
- Failed to parse (syntax error): {\bf D}(n) is the set of all descendants of node
- Branches are numbered by the node they lead to, i.e. 2,3...N
- is the "branch length" leading to node , so e.g. is the root branch length
- A "branch transducer" is associated with every branch
- is the transition probability from to for a branch transducer with branch length
- is the state space of the branch transducer (and is assumed to be independent of branch length)
- is the state type of for , and takes values in Failed to parse (unknown function "\tt"): \left\{ {\tt S}, {\tt E}, {\tt W}, \stackrel{xy}{{\tt M}}, \stackrel{x}{{\tt D}}, \stackrel{y}{{\tt I}} \right\}
Note that "branch length" can be a continuous real variable, but it doesn't have to be; in fact it can be any label associated with the branch.
Also recall, for any Collapsed EHMM state , that:
- is the state of the transducer on the branch leading to node
- is the state type of
- is the emitter node of , i.e. the highest-numbered Insert node:
Failed to parse (unknown function "\tt"): \displaystyle \beta(\phi) = \max \left\{ 2,\ \ n : \chi_n = \stackrel{y}{\tt I} \right\}
We introduce some new definitions:
- is the first node that is "eclipsed" by the emitter in state , or N+1 if none are:
Failed to parse (syntax error): \gamma(\phi) = \min \{ N+1,\ \ n : n > \beta(\phi), n \notin {\bf D}(\beta(\phi)) \}
- is the effective transition probability via intermediate wait states:
Failed to parse (unknown function "\tt"): \displaystyle \upsilon(\psi,\psi';T) = \tau(\psi,\psi';T) + \sum_{\psi'' \in \Psi:\chi(\psi'') = {\tt W}} \tau(\psi,\psi'';T) \tau(\psi'',\psi';T)
- is a shorthand for the destination emitter.
---
The algorithm for deciding whether a transition is allowed is as follows. The composite transducer state is first partitioned into subtrees as in the following table. For every node within each subtree, the appropriate condition must be satisfied if the transition is to be allowed:
Subtree | Condition | Term in |
Failed to parse (unknown function "\tt"): \upsilon(\psi_n,\psi'_n;T_n) > 0\ \vee\ \left(\psi_n = \psi'_n\ \wedge\ \chi_n={\tt W}\right) | ||
Failed to parse (unknown function "\tt"): \tau(\psi_n,\psi'_n;T_n) > 0\ \vee\ \left(\psi_n = \psi'_n\ \wedge\ \chi_n={\tt W}\right) | ||
A description of factors follows.
---
Factor prevents transducers from changing state if they are prior to the destination emitter, or eclipsed by both source and destination emitters:
---
Factor corresponds to the change of state of the transducer at the destination emitter:
---
Factor collects contributions from descendants of the destination emitter:
where
Failed to parse (lexing error): y_n = \left\{ \begin{array}{rl} \upsilon(\psi_n,\psi'_n;T_n) & \mbox{if $\chi_n \neq {\tt W}$ or $\chi'_n \neq {\tt W}$} } \\ \delta(\psi_n = \psi'_n) & \mbox{if $\chi_n = \chi'_n = {\tt W}$} \end{array} \right.
These transitions can use intermediate Wait states, which is why we use rather than .
---
Factor collects contributions from newly eclipsed transducers:
where
Failed to parse (lexing error): z_n = \left\{ \begin{array}{rl} \tau(\psi_n,\psi'_n;T_n) & \mbox{if $\chi_n \neq {\tt W}$} } \\ \delta(\psi_n = \psi'_n) & \mbox{if $\chi_n = {\tt W}$} \end{array} \right.
It is trivial to generalize the above formalism so that the state space of the branch transducer can depend on the branch label , as in the phylocomposer program.
Ian Holmes - 07 Jun 2007