Zip Transducer

From Biowiki
Jump to: navigation, search

The zip transducer

The "zip transducer" is my nickname for a class of lexicalized String Transducers capable of making finite-length local duplications (or inversions).

(The nickname "zip" is because it re-inserts recent substrings in a manner reminiscent of the Lempel Ziv compression algorithm. In fact if you marginalize out the input tape of the zip transducer described below and couple it to an Arithmetic Coding algorithm, you basically end up with LZ77 compression.)

The transducer is "lexicalized" meaning that it carries a memory of the last N symbols on the input tape, and pointers into this memory.

Let \Omega be the terminal alphabet and let M be the number of terminal symbols in \Omega. Each "lexicalized state" is labeled with a string S \in \Omega^\ast of length L \leq N, where S_n is the n'th most recently absorbed symbol.

There are thus K = M^{N+1}-1 versions of every lexicalized state, one for every possible string S.

Transitions between these sets of K lexicalized states are constrained by the definition of S as "the most recent N input symbols":

  • Any transition between lexicalized states A_S \to {}^x A'_{S'} where A' absorbs a symbol x is only allowed if L'=\mbox{min}(N,L+1) (where L' is the length of S'), S'_1=x and S'_{n+1}=S_n.
  • Any transition between lexicalized states A_S \to A'_{S'} where no symbol is absorbed is only allowed if S=S'.

Certain "copying states" are additionally labeled with an length-distance pair (\ell,\delta) with 1 \leq \ell \leq \delta \leq N. The terminology, \ell a "length" and \delta a "distance", is directly borrowed from the LZ77 algorithm. Every copying state is a lexicalized state, so there are KH versions of every copying state where H = \frac{1}{2}N(N+1).

Consider a string transducer with the following states:

  • \{W_S\}, a set of K lexicalized wait states;
  • \{{}^x_x M_S\}, a set of KM lexicalized match states, each absorbing a symbol x and re-emitting it;
  • \{{}_x I^{\ell,\delta}_S\}, a set of KH lexicalized copying insert states, each emitting a symbol x = S_\delta.

Transitions between the copying insert states...

{}_x I^{\ell,\delta}_S \to {}_{x'} I^{\ell',\delta'}_S

...are only allowed if \ell' = \ell - 1 and \delta' = \delta - 1. This is in addition to the usual constraints on lexicalized transitions.

Here is a table of transitions and their properties. The parameter p is the probability of inserting a recent substring at any position.

Source Destination Absorbs Emits Probability Multiplicity Notes Constraints
START W_\epsilon 1 1 No symbols emitted yet Dest. state is labeled with empty string \epsilon
W_S {}^x_x M_{S'} x x 1 KM Copy a symbol from input to output Lexicalized transition constraints on S,S'
{}^x_x M_S W_S 1-p KM Wait for next input symbol, or...
{}^x_x M_S {}_{x'} I^{\ell,\delta}_S x'=S_\delta p/H KMH ...start inserting a recent substring of the input
{}_x I^{\ell+1,\delta+1}_S {}_{x'} I^{\ell,\delta}_S x'=S_\delta 1 K(H-N) Continue the insertion
{}_x I^{1,\delta}_S W_S 1 KN Terminate the insertion
W_S END 1 K End of input

Strictly, this is under-normalized: for the first N-1 symbols of the input tape, there are fewer than H possible outgoing transitions from each match state, so the outgoing transitions from {}^x_x M_S sum to less than one. A correction to the {}^x_x M_S \to W_S transition can account for this missing probability.

We would like to develop variants of the zip transducer that

  • introduce local mutations (i.e. substitutions and indels), both when matching & inserting a recent substring;
  • generate inversions;
  • generate a string of multiple insertions of the same recent substring;
  • have more flexible structure for the probability distribution over length and distance;
  • (possibly) maintain recent context in the output tape as well as the input.

We would also like to develop

  • efficient DP algorithms for the zip transducer (e.g. using the fact that we always know what the last N input symbols were at any given cell in the DP matrix, so we can avoid the penalty factor K);
  • analyses of zip transducer compositions.

Most of all, we would like to start fitting zip transducers to genome sequences.