String transducers---a.k.a. sequence transducers---are finite-state machines (that is to say, automata
) with an input tape
and an output tape
(in the case of a two-tape transducer), or an input tape
and multiple output tapes
(in the general, 1-input N-output case).
A two-tape transducer, with one input and one output, is similar to a Pair HMM
the major difference is that the probabilistic weights are normalized conditional on the input sequence.
That is, the transition & emission parameters are such that the Forward algorithm, given
, computes the conditional probability
In contrast, the Forward algorithm for a Pair HMM computes the joint probability
is the probability that the transducer converts input
A typical way to guarantee that a transducer meets this criterion for all possible
is to normalize every emission (and transition) conditional on the input symbol (and input sequence length).
As with Pair HMMs, string transducers are decent models for substitutions, indels and similar "local" mutations.
However, they can also be systematically combined to yield probabilistic models for multiple, phylogenetically-related sequences.
This algorithm, TransducerComposition
, makes them great tools for statistical alignment
, for analyzing sequence evolution on trees
and for phylogenomic modeling & reconstruction.
Other names for transducers
Alternate names for string transducers include:
- Input/Output Automata, Alignment Automata, Stochastic Automata;
- Moore Machines, Mealy Machines;
- "Conditional HMMs", meaning conditionally-normalized Pair HMMs, Multiple HMMs, Evolutionary HMMs or Phylo-HMMs.
We use these terms more-or-less interchangeably, although there are precise and subtly different definitions in common usage for all of them
(with well-defined relationships between the various definitions;
e.g. we can assert that
"the general Phylo-HMM implemented by xrate is just a special-case Evolutionary HMM without indels: a singleton transducer composed with a phylogenetic composition of branch transducers, none of which have insert or delete states"
and while this may sound cryptic, it is a well-formed and easily provable statement in transducer theory).
Spritually similar models
A number of models and inference algorithms in the molecular evolution algorithms literature are representable as transducers.
Some are almost exactly the same thing, e.g.
Others are slightly more restrictive; like the TKF91 and TKF92 models, which do not allow indels to overlap
Substitution-only models are also (trivially) representable as transducers, of course;
as, in the same way, are the various models that use a finite-state continuous time Markov chain over N alphabet symbols and an extra "gap" symbol,
sometimes with corrections that approximate transducer models.
The transducers we use here are finite in their state space, unlike for example the LongIndelModel
which introduces realistic gap length distributions to the TKF model, at the cost of increasing the alignment complexity from that of a finite-state pair HMM (or transducer) to that of a generalized pair HMM with arbitrary distributions over emission lengths.
Generalizations of transducers
is useful for modeling context-sensitive substitutions, indels and even local micro-duplications and micro-satellite dynamics.
can be used for modeling the evolution of "parse trees" (e.g. noncoding RNA gene structures), as well as strings.
It is also possible to generalize the idea to a transducer with multiple inputs
as well as outputs
, to model recombination events and other exchanges of genetic material.
Pages on biowiki
There are several pages on this wiki about transducers applied to bioinformatics:
Pages on Wikipedia
Wikipedia has some good background on finite-state transducers as used in computer science:
- Computational linguistics
- People interested in and/or developing this stuff (please add yourself)...