Clade-specific structure grammar
Motivation
- Detecting species-specific structured elements, particularly in viruses
- Estimating rate/frequency of these events, as well as their 'location' on species tree
Relevant Papers
- Steil BP, Barton DJ. Cis-active RNA elements (CREs) and picornavirus RNA replication. Virus Res. 2009 Feb;139(2):240-52. doi: 10.1016/j.virusres.2008.07.027. Epub 2008 Sep 20.
- Pedersen JS, Meyer IM, Forsberg R, Simmonds P, Hein J. A comparative method for finding and folding RNA secondary structures within protein-coding regions. Nucleic Acids Res. 2004 Sep 24;32(16):4925-36. Print 2004.
- Geller R, Vignuzzi M, Andino R, Frydman J. Evolutionary constraints on chaperone-mediated folding provide an antiviral approach refractory to development of drug resistance. Genes Dev. 2007 Jan 15;21(2):195-205.
- Vignuzzi M, Wendt E, Andino R. Engineering attenuated virus vaccines by controlling replication fidelity. Nat Med. 2008 Feb;14(2):154-61. doi: 10.1038/nm1726. Epub 2008 Feb 3.
- Watts JM, Dang KK, Gorelick RJ, Leonard CW, Bess JW Jr, Swanstrom R, Burch CL, Weeks KM. Architecture and secondary structure of an entire HIV-1 RNA genome. Nature. 2009 Aug 6;460(7256):711-6. doi: 10.1038/nature08237.
Examples
S_* are structural states, whereas N_* are non-structural. This example leads me to believe that the hybrid-chains are not working correctly, since the mutations in A,B
ideally shouldn't affect the S_2 state. However, I've only just set up the grammar and not trained it or finely-set the parameters...
Example 1: "structure" conserved in C and D
# STOCKHOLM 1.0
#=GF NH ((A:0.1000000000,B:0.1000000000)AB:0.1000000000,(C:0.1000000000,D:0.1000000000)CD:0.1000000000)root;
#=GF SC_max_cladeStructure -121.775
A AAAAAACGAACTCTTTTTT
B AGAAAACGAAGCATATTTT
C AAAAAACGATCCTTTTTTT
D AAAAAACGATCCTTTTTTT
#=GC N_AB ...................
#=GC N_CD ...................
#=GC N_root ...................
#=GC S_AB ...................
#=GC S_CD --<<<<--<---->->>>>
#=GC S_root ...................
#=GC S_CD: ideal <<<<<<------>>>>>>
//
Using pfold grammar:
# STOCKHOLM 1.0
#=GF NH ((A:0.1000000000,B:0.1000000000)AB:0.1000000000,(C:0.1000000000,D:0.1000000000)CD:0.1000000000)root;
#=GF SC_max_pfold -90.3301
A AAAAAACGAACTCTTTTTT
B AGAAAACGAAGCATATTTT
C AAAAAACGATCCTTTTTTT
D AAAAAACGATCCTTTTTTT
#=GC PFOLD ...................
//
After macro-ifying, now works better than before (?). There was some small parameter fiddling I suppose. Also a different alignment.
# STOCKHOLM 1.0
#=GF NH ((A:0.1000000000,B:0.1000000000)AB:0.1000000000,(C:0.1000000000,D:0.1000000000)CD:0.1000000000)root;
#=GF SC_max_CladeStructure -122.65
#=GC CS .................
#=GC BP .................
A ATTAAACGAACTCTGGT
B AGAACACGAAGCATATT
C AAAAAACGATCCTTTTT
D AAAAAACGATCCTTTTT
#=GC N_AB .................
#=GC N_CD .................
#=GC N_root .................
#=GC S_AB .................
#=GC S_CD <<<<<<<--->>>>>>>
#=GC S_root .................
Example 2: structure conserved in all leaves. Should annotate to S_root, but doesn't:
# STOCKHOLM 1.0
#=GF NH ((A:0.1000000000,B:0.1000000000)AB:0.1000000000,(C:0.1000000000,D:0.1000000000)CD:0.1000000000)root;
#=GF SC_max_cladeStructure -60.1976
A AAAAAACGAACCTTTTTTT
B AAAAAACGAACCTTTTTTT
C AAAAAACGATCCTTTTTTT
D AAAAAACGATCCTTTTTTT
#=GC N_AB ...................
#=GC N_CD ...................
#=GC N_root ...................
#=GC S_AB -------------------
#=GC S_CD ...................
#=GC S_root ...................
//
[oscar@sheridan testXrateGrammars]$ xrate -g pfold.eg allStem.stk
# STOCKHOLM 1.0
#=GF NH ((A:0.1000000000,B:0.1000000000)AB:0.1000000000,(C:0.1000000000,D:0.1000000000)CD:0.1000000000)root;
#=GF SC_max_pfold -62.0713
A AAAAAACGAACCTTTTTTT
B AAAAAACGAACCTTTTTTT
C AAAAAACGATCCTTTTTTT
D AAAAAACGATCCTTTTTTT
#=GC PFOLD <<<<<<......>>>>>>.
//
Example 3: partially-forced state path. Alignments are annotated with binary variables indicating whether or not a structure is emitted from given state.
# STOCKHOLM 1.0
#=GF NH ((A:0.1000000000,B:0.1000000000)AB:0.1000000000,(C:0.1000000000,D:0.1000000000)CD:0.1000000000)root;
#=GF SC_max_CladeStructure -119.883
A ATTAAACGAACTCTGGT
B AGAACACGAAGCATATT
C AAAAAACGATCCTTTTT
D AAAAAACGATCCTTTTT
#=GC N_AB .................
#=GC N_CD .................
#=GC N_root .................
#=GC S_AB .................
#=GC S_AB_structBin 00000000000000000
#=GC S_CD <<<<<<<--->>>>>>>
#=GC S_CD_structBin 11111111111111111
#=GC S_root .................
#=GC S_root_structBin 00000000000000000
//
For example, if we annotate the input alignment with having a structure in the AB clade (which is purposely unreasonable), we get the following:
[oscar@sheridan testXrateGrammars]$ cat practiceTrain.stk
# STOCKHOLM 1.0
#=GF NH ((A:0.1000000000,B:0.1000000000)AB:0.1000000000,(C:0.1000000000,D:0.1000000000)CD:0.1000000000)root;
#=GF SC_max_CladeStructure -119.883
A ATTAAACGAACTCTGGT
B AGAACACGAAGCATATT
C AAAAAACGATCCTTTTT
D AAAAAACGATCCTTTTT
#=GC S_AB_structBin 11111111111111111
//
[oscar@sheridan testXrateGrammars]$ xrate -g CS_ian.eg practiceTrain.stk
# STOCKHOLM 1.0
#=GF SC_max_CladeStructure -119.883
#=GF NH ((A:0.1000000000,B:0.1000000000)AB:0.1000000000,(C:0.1000000000,D:0.1000000000)CD:0.1000000000)root;
#=GF SC_max_CladeStructure -138.287
A ATTAAACGAACTCTGGT
B AGAACACGAAGCATATT
C AAAAAACGATCCTTTTT
D AAAAAACGATCCTTTTT
#=GC N_AB .................
#=GC N_CD .................
#=GC N_root .................
#=GC S_AB <<<<<<<--->>>>>>>
#=GC S_AB_structBin 11111111111111111
#=GC S_CD .................
#=GC S_CD_structBin 00000000000000000
#=GC S_root .................
#=GC S_root_structBin 00000000000000000
//
Automatic naming of internal nodes. By specifying a list of groups, each group's MRCA is named and later given it's own rate-shifted state. Here this is done for rhinoviruses A,B,C and poliovirus outgroup:
Problems with partially-annotated alignments.
- Can't handle transitions between S and N states. This alignment has likelihood -inf:
# STOCKHOLM 1.0
#=GF NH ((A:0.1000000000,B:0.1000000000)AB:0.1000000000,(C:0.1000000000,D:0.1000000000)CD:0.1000000000)root;
A ATTAAACGAACTCTGGT
B AGAACACGAAGCATATT
C AAAAAACGATCCTTTTT
D AAAAAACGATCCTTTTT
#=GC S_AB_structBin 00000000111111111
#=GC N_root 11111111000000000
//
- HOwever, transitions between N states are ok:
# STOCKHOLM 1.0
#=GF NH ((A:0.1000000000,B:0.1000000000)AB:0.1000000000,(C:0.1000000000,D:0.1000000000)CD:0.1000000000)root;
A ATTAAACGAACTCTGGT
B AGAACACGAAGCATATT
C AAAAAACGATCCTTTTT
D AAAAAACGATCCTTTTT
#=GC N_AB 00000000000111111
#=GC N_root 11111111111000000
#=GC N_CD 00000000000000000
#=GC S_AB_structBin 00000000000000000
#=GC S_CD_structBin 00000000000000000
#=GC S_root_structBin 00000000000000000
//
Solution to previous example
- The rule i_N* -> j_S k_N forces each structural region to be flanked by a nonstructural region(see examples below). This really ought to be changed in the production rules...
This alignment has nonzero likelihood:
[oscar@garibaldi cladeStructure]$ xrate -g grammars/cladeStructure.eg alignments/practice2.stk
# STOCKHOLM 1.0
#=GF NH ((A:0.1000000000,B:0.1000000000)AB:0.1000000000,(C:0.1000000000,D:0.1000000000)CD:0.1000000000)root;
#=GF SC_max_CladeStructure -149
A ATTAAACGAACTCTGGTG
B AGAACACGAAGCATATTG
C AAAAAACGATCCTTTTTG
D AAAAAACGATCCTTTTTG
#=GC N_AB 111111111110000001
#=GC N_CD 000000000000000000
#=GC N_root 000000000000000000
#=GC S_AB ..................
#=GC S_AB_structBin 000000000000000000
#=GC S_CD ...........<<-->>.
#=GC S_CD_structBin 000000000001111110
#=GC S_root ..................
#=GC S_root_structBin 000000000000000000
//
Whereas this one evaluates to -inf:
[oscar@garibaldi cladeStructure]$ cat alignments/practice3.stk
# STOCKHOLM 1.0
#=GF NH ((A:0.1000000000,B:0.1000000000)AB:0.1000000000,(C:0.1000000000,D:0.1000000000)CD:0.1000000000)root;
A ATTAAACGAACTCTGGTG
B AGAACACGAAGCATATTG
C AAAAAACGATCCTTTTTG
D AAAAAACGATCCTTTTTG
#=GC N_AB 111111111110000000
#=GC N_root 000000000000000000
#=GC N_CD 000000000000000000
#=GC S_AB_structBin 000000000000000000
#=GC S_CD_structBin 000000000001111111
#=GC S_root_structBin 000000000000000000
- Possible workaround:
- "top level" nonterminal which spits out structured and nonstructured nonterminals. Do away with transformations between S and N states
START -> Top
Top -> i_N_Top_bif -> i_N Top
Top -> i_S_Top_bif -> i_S Top
Top -> END
--
OscarWestesson - 01 Oct 2009