The zip transducer
The "zip transducer" is my nickname for a class of lexicalized String Transducers capable of making finite-length local duplications (or inversions).
(The nickname "zip" is because it re-inserts recent substrings in a manner reminiscent of the Lempel Ziv compression algorithm. In fact if you marginalize out the input tape of the zip transducer described below and couple it to an Arithmetic Coding algorithm, you basically end up with LZ77 compression.)
The transducer is "lexicalized" meaning that it carries a memory of the last symbols on the input tape, and pointers into this memory.
Let be the terminal alphabet and let be the number of terminal symbols in . Each "lexicalized state" is labeled with a string of length , where is the 'th most recently absorbed symbol.
There are thus versions of every lexicalized state, one for every possible string .
Transitions between these sets of lexicalized states are constrained by the definition of as "the most recent input symbols":
- Any transition between lexicalized states where absorbs a symbol is only allowed if (where is the length of ), and .
- Any transition between lexicalized states where no symbol is absorbed is only allowed if .
Certain "copying states" are additionally labeled with an length-distance pair with . The terminology, a "length" and a "distance", is directly borrowed from the LZ77 algorithm. Every copying state is a lexicalized state, so there are versions of every copying state where .
Consider a string transducer with the following states:
- , a set of lexicalized wait states;
- , a set of lexicalized match states, each absorbing a symbol and re-emitting it;
- , a set of lexicalized copying insert states, each emitting a symbol .
Transitions between the copying insert states...
...are only allowed if and . This is in addition to the usual constraints on lexicalized transitions.
Here is a table of transitions and their properties. The parameter is the probability of inserting a recent substring at any position.
|START||1||1||No symbols emitted yet||Dest. state is labeled with empty string|
|1||KM||Copy a symbol from input to output||Lexicalized transition constraints on|
|KM||Wait for next input symbol, or...|
|KMH||...start inserting a recent substring of the input|
|1||K(H-N)||Continue the insertion|
|1||KN||Terminate the insertion|
|END||1||K||End of input|
Strictly, this is under-normalized: for the first symbols of the input tape, there are fewer than possible outgoing transitions from each match state, so the outgoing transitions from sum to less than one. A correction to the transition can account for this missing probability.
We would like to develop variants of the zip transducer that
- introduce local mutations (i.e. substitutions and indels), both when matching & inserting a recent substring;
- generate inversions;
- generate a string of multiple insertions of the same recent substring;
- have more flexible structure for the probability distribution over length and distance;
- (possibly) maintain recent context in the output tape as well as the input.
We would also like to develop
- efficient DP algorithms for the zip transducer (e.g. using the fact that we always know what the last input symbols were at any given cell in the DP matrix, so we can avoid the penalty factor );
- analyses of zip transducer compositions.
Most of all, we would like to start fitting zip transducers to genome sequences.