Telegraph is an (unfinished) project to implement (in ANSI C and bison) a compiler to generate dynamic programming algorithms for multi-tape Stochastic Context Free Grammars (SCFGs), also implemented in ANSI C.
Phylogenetic alignment is a powerful tool used for inferring phlogenetic trees from multiple sequence alignments using probabilistic methods. In particular, this type of alignment can identify the parts of the sequences that have been evolutionarily conserved, and mark these as common between species. An important application of phylogenetic alignment is the alignment of viruses, HIV for example. AIDS is caused by two viruses HIV-1, the cause of the dangerous pandemic, and HIV-2, the less widespread and damaging of the two. Moreover, the Simian Immunode ciency Virus (SIV), a related virus has been detected in monkeys and apes. Phylogenetic align- ment can be used to identify the \important" parts of the virus that can then be used in vaccine developement.
RNA secondary structure prediction is another tool that can be used to study biological function. DNA is stable enough to study, although it doesn't do much other than store information really well. In contrast, protein has the interesting quality of being directly related to function, but the added complexity of a twenty amino acid alphabet in proteins makes it di cult to work with on an information processing level. RNA, however, is both stable and interesting with its four bases and folding properties. In particular, the base-pairings defining the secondary structure are commonly more conserved between RNA than their sequences (Durbin & Eddy).
An ideal method of modelling RNA secondary structure makes use of context free grammars. RNA secondary structure is often palindromic with lengths of nested base pairings. It often comprises short base-paried stems as RNA is usually a single strand that folds onto itself. For the sequence above, we can infer this RNA stem loop structure, shown in two-dimensional form:
To generate this palindromic structure, we can create a grammar that assigns high probabilities to rules that emit these canonical A-U, G-C base pairings. This grammar captures the nesting tendencies of RNA secondary structure.
Telegraph is language designed to specify rules for a stochastic context-free grammar--grammars that allow us to probabilistically model RNA secondary structure. We are building a compiler that will ultimately be able to compile a file comprising a set of alphabets and a set of context free production rules associated with a probability. Using these production rules, the compiler generates a set of parse trees, each of which is given a cumulative probability determined by the order of production rules yielding the particular parse.
While the rules may be written in any form, such as Chomsky normal form, the compiler will convert them into RNA Normal Form, which is intended to favor emissions over bifurcation. RNA Normal Form follows the format
- L-> R
- L-> RT
- L-> x1 y1 R x2 y2
where (1) is a transition from nonterminal L to nonterminal R, (2) is a bifurcation from nonterminal L to nonterminals R and T, and (3) is a transition from nonterminal L to a nonterminal R flanked by at most one terminal from each alphabet, x and y in this example. In the parse tree, the root nodes are nonterminals, and the leaf node correspond to terminals of the alphabets. The tree with the highest score would thus represent the most probable parse of the sequence.
Telegraph source code is bundled with the DART library.
- tele.pdf: Telegraph: What has been done and what still needs to be finished
--- Last update: June 16, 2005 Sharon Chao