Indie Gram

From Biowiki
Jump to: navigation, search

Indie Gram

TreeTransducers, finite-state machines which can mutate parse trees, provide a framework for modeling the evolution of structured RNAs along a phylogenetic tree.

Our software Indiegram implements a tree-transducer-based model of the evolution of three structured RNAs. The TKF Structure Tree model is provided with the software, but users can create their own models using the provided Perl code-generation scripts. Given a model, the program Indiegram, distributed as part of the DART software package, perform Maximum-Likelihood inference.

The Perl scripts found in the src/indiegram/perl directory implement our model-construction algorithm for evolutionary models of structured RNAs: Given a singlet transducer and a branch transducer, they generate the corresponding evolutionary model representing three sequences related to an unknown common ancestor by a star phylogeny.

For examples of Indiegram run on real RNA, see here: Indiegram Examples

Model construction

The Perl script fourWay.pl is a driver script for the packages implementing the transducer-composition algorithms on a three-branch tree. It can be run from the dart/src/indiegram directory as

	./perl/fourWay.pl --graph tt/simpleTest.singlet.tt tt/simpleTest.branch.tt

where the --graph option tells it to generate the transition/bifurcation graph of the composite model. A full list of options is available via the --help option.

The program Indiegram uses code generated using the related driver script tripletSCFG.pl

	./perl/tripletSCFG.pl --write tt/tkfst.singlet.tt tt/tkfst.branch.tt

where the --write command tells it to generate C++ code for constructing the model using the dart API. (See files DartSrc:indiegram/tkfst.h & DartSrc:indiegram/tkfst.cc for example usage in indiegram.)

The following section describes the .tt file format used to specify singlet and branch transducers.

Transducer file formats

Here is a detailed description of file formats understood by the Perl code-generation packages run by tripletSCFG.pl.

Several example files are provided:

  1. simpleTest.singlet.tt and simpleTest.branch.tt: Left-emitting HMM-like structure; bifurcations initiated from the root.
  2. bifurcTest.singlet.tt and bifurcTest.branch.tt: Left-emitting HMM-like structure; bifurcations initiated from either the root or leaves.
  3. tkfst.singlet.tt and tkfst.branch.tt: The full TKF Structure Tree model.

We give excerpts from the simpleTest model below.

State typing

States are typed as follows:

State type symbol
Start s
Insert i
Insert (Bifurcation) bi
Match m
Match (Bifurcation) bm
End e

simpleTest.singlet.tt:

	>singlet-stateTyping
	s = s
	il = i
	Bi[s s] = bi
	e = e

simpleTest.branch.tt:

	>branch-stateTyping
	s = s
	il = i
	ml = m
	dl = m
	w = w
	Bm[s s] = bm
	e = e

State absorptions

Match states (only allowed for branch transducers) can absorb as well as emit symbols.

State absorptions can be:

Absorption profile symbol
Left terminal l
Right terminal r
Left and right terminals lr
Nonterminal nonterminal name

Bifurcations correspond to emission of nonterminals in our model-construction algorithm. In the above example the state Bm[s s] absorbs a nonterminal named Bi[s s] (which is emitted by the singlet transducer).

simpleTest.branch.tt:

	>branch-absorbProfiling
	ml = l
	dl = l
	Bm[s s] = Bi[s s]

State emissions

States can have emission profiles:

Emission profile symbol
Left terminal l
Right terminal r
Left and right terminals lr
Nonterminal nonterminal name

Bifurcations correspond to emission of nonterminals in our model-construction algorithm. In the above example the state Bi[s s] emits a nonterminal named Bi[s s].

simpleTest.singlet.tt:

	>singlet-emitProfiling
	il = l
	Bi[s s] = Bi[s s]

simpleTest.branch.tt:

	>branch-emitProfiling
	il = l
	ml = l
	dl = 
	Bm[s s] = Bm[s s]

Note that the "Delete" state dl absorbs symbols but has an empty emission.

The probability distributions of emissions can be specified as well.

simpleTest.singlet.tt:

	>singlet-emitDist
	il = p

State il emits with probability p. It is assumed that the emission probability is a function p of the emitted character(s).

simpleTest.branch.tt:

	>branch-emitDist
	il = p
	ml = m_t

The distribution m_t is interpreted to mean that the function m depends on a (time) parameter t, ie m(t).

State transitions

The transition matrix and bifurcations are specified separately.

simpleTest.singlet.tt:

	>singlet-tm
	s -> il = 1-ep(), e = ep()
	il -> il = ilp(), Bi[s s] = blp(), e = 1-ilp()-blp()
	e ->

simpleTest.branch.tt:

	>branch-tm
	s -> il = ilp(t), w = 1-ilp(t)
	il -> il = ilp(t), w = 1-ilp(t)
	ml -> il = ilp(t), w = 1-ilp(t)
	dl -> il = ilp(t), w = 1-ilp(t)
	w -> ml = mlp(t), dl = 1-mlp(t), Bm[s s] = 1, e = 1
	e -> 

The children of bifurcations are specified as (left, center, right):

simpleTest.singlet.tt:

	>singlet-bifurc
	Bi[s s] -> (s, s, e)

simpleTest.branch.tt:

	>branch-bifurc
	Bm[s s] -> (s, s, e)

Pretty display

The user can specify the order in which to sort states. This is very useful for looking at the output of the composition algorithm.

simpleTest.singlet.tt:

	>singlet-stateSorting
	s = 0
	il = 1
	Bi[s s] = 5
	e = 10

simpleTest.branch.tt:

	>branch-stateSorting
	s = 0
	il = 1
	ml = 2
	dl = 2.5
	w = 3
	Bm[s s] = 5
	e = 10

-- Robert Bradley - 07 Aug 2008