Clade-specific structure grammar

Motivation

  • Detecting species-specific structured elements, particularly in viruses
  • Estimating rate/frequency of these events, as well as their 'location' on species tree

Relevant Papers

Examples

S_* are structural states, whereas N_* are non-structural. This example leads me to believe that the hybrid-chains are not working correctly, since the mutations in A,B ideally shouldn't affect the S_2 state. However, I've only just set up the grammar and not trained it or finely-set the parameters...

Example 1: "structure" conserved in C and D

# STOCKHOLM 1.0
#=GF NH                    ((A:0.1000000000,B:0.1000000000)AB:0.1000000000,(C:0.1000000000,D:0.1000000000)CD:0.1000000000)root;
#=GF SC_max_cladeStructure -121.775
A                          AAAAAACGAACTCTTTTTT
B                          AGAAAACGAAGCATATTTT
C                          AAAAAACGATCCTTTTTTT
D                          AAAAAACGATCCTTTTTTT
#=GC N_AB                  ...................
#=GC N_CD                  ...................
#=GC N_root                ...................
#=GC S_AB                  ...................
#=GC S_CD                  --<<<<--<---->->>>>
#=GC S_root                ...................
#=GC S_CD: ideal           <<<<<<------>>>>>>  
//

Using pfold grammar:

# STOCKHOLM 1.0
#=GF NH           ((A:0.1000000000,B:0.1000000000)AB:0.1000000000,(C:0.1000000000,D:0.1000000000)CD:0.1000000000)root;
#=GF SC_max_pfold -90.3301
A                 AAAAAACGAACTCTTTTTT
B                 AGAAAACGAAGCATATTTT
C                 AAAAAACGATCCTTTTTTT
D                 AAAAAACGATCCTTTTTTT
#=GC PFOLD        ...................
//


After macro-ifying, now works better than before (?). There was some small parameter fiddling I suppose. Also a different alignment.

# STOCKHOLM 1.0
#=GF NH                    ((A:0.1000000000,B:0.1000000000)AB:0.1000000000,(C:0.1000000000,D:0.1000000000)CD:0.1000000000)root;
#=GF SC_max_CladeStructure -122.65
#=GC CS      .................
#=GC BP      .................
A            ATTAAACGAACTCTGGT
B            AGAACACGAAGCATATT
C            AAAAAACGATCCTTTTT
D            AAAAAACGATCCTTTTT
#=GC N_AB    .................
#=GC N_CD    .................
#=GC N_root  .................
#=GC S_AB    .................
#=GC S_CD    <<<<<<<--->>>>>>>
#=GC S_root  .................


Example 2: structure conserved in all leaves. Should annotate to S_root, but doesn't:

# STOCKHOLM 1.0
#=GF NH                    ((A:0.1000000000,B:0.1000000000)AB:0.1000000000,(C:0.1000000000,D:0.1000000000)CD:0.1000000000)root;
#=GF SC_max_cladeStructure -60.1976
A                          AAAAAACGAACCTTTTTTT
B                          AAAAAACGAACCTTTTTTT
C                          AAAAAACGATCCTTTTTTT
D                          AAAAAACGATCCTTTTTTT
#=GC N_AB                  ...................
#=GC N_CD                  ...................
#=GC N_root                ...................
#=GC S_AB                  -------------------
#=GC S_CD                  ...................
#=GC S_root                ...................
//
[oscar@sheridan testXrateGrammars]$ xrate -g pfold.eg allStem.stk
# STOCKHOLM 1.0
#=GF NH           ((A:0.1000000000,B:0.1000000000)AB:0.1000000000,(C:0.1000000000,D:0.1000000000)CD:0.1000000000)root;
#=GF SC_max_pfold -62.0713
A                 AAAAAACGAACCTTTTTTT
B                 AAAAAACGAACCTTTTTTT
C                 AAAAAACGATCCTTTTTTT
D                 AAAAAACGATCCTTTTTTT
#=GC PFOLD        <<<<<<......>>>>>>.
//

Example 3: partially-forced state path. Alignments are annotated with binary variables indicating whether or not a structure is emitted from given state.

# STOCKHOLM 1.0
#=GF NH                    ((A:0.1000000000,B:0.1000000000)AB:0.1000000000,(C:0.1000000000,D:0.1000000000)CD:0.1000000000)root;
#=GF SC_max_CladeStructure -119.883
A                          ATTAAACGAACTCTGGT
B                          AGAACACGAAGCATATT
C                          AAAAAACGATCCTTTTT
D                          AAAAAACGATCCTTTTT
#=GC N_AB                  .................
#=GC N_CD                  .................
#=GC N_root                .................
#=GC S_AB                  .................
#=GC S_AB_structBin        00000000000000000
#=GC S_CD                  <<<<<<<--->>>>>>>
#=GC S_CD_structBin        11111111111111111
#=GC S_root                .................
#=GC S_root_structBin      00000000000000000
//

For example, if we annotate the input alignment with having a structure in the AB clade (which is purposely unreasonable), we get the following:


[oscar@sheridan testXrateGrammars]$ cat practiceTrain.stk
# STOCKHOLM 1.0
#=GF NH                    ((A:0.1000000000,B:0.1000000000)AB:0.1000000000,(C:0.1000000000,D:0.1000000000)CD:0.1000000000)root;
#=GF SC_max_CladeStructure -119.883
A                          ATTAAACGAACTCTGGT
B                          AGAACACGAAGCATATT
C                          AAAAAACGATCCTTTTT
D                          AAAAAACGATCCTTTTT
#=GC S_AB_structBin        11111111111111111
//
[oscar@sheridan testXrateGrammars]$ xrate -g CS_ian.eg practiceTrain.stk
# STOCKHOLM 1.0
#=GF SC_max_CladeStructure -119.883
#=GF NH                    ((A:0.1000000000,B:0.1000000000)AB:0.1000000000,(C:0.1000000000,D:0.1000000000)CD:0.1000000000)root;
#=GF SC_max_CladeStructure -138.287
A                          ATTAAACGAACTCTGGT
B                          AGAACACGAAGCATATT
C                          AAAAAACGATCCTTTTT
D                          AAAAAACGATCCTTTTT
#=GC N_AB                  .................
#=GC N_CD                  .................
#=GC N_root                .................
#=GC S_AB                  <<<<<<<--->>>>>>>
#=GC S_AB_structBin        11111111111111111
#=GC S_CD                  .................
#=GC S_CD_structBin        00000000000000000
#=GC S_root                .................
#=GC S_root_structBin      00000000000000000
//

Automatic naming of internal nodes. By specifying a list of groups, each group's MRCA is named and later given it's own rate-shifted state. Here this is done for rhinoviruses A,B,C and poliovirus outgroup:

Picture7.png

Problems with partially-annotated alignments.

  • Can't handle transitions between S and N states. This alignment has likelihood -inf:

# STOCKHOLM 1.0
#=GF NH                    ((A:0.1000000000,B:0.1000000000)AB:0.1000000000,(C:0.1000000000,D:0.1000000000)CD:0.1000000000)root;
A                          ATTAAACGAACTCTGGT
B                          AGAACACGAAGCATATT
C                          AAAAAACGATCCTTTTT
D                          AAAAAACGATCCTTTTT
#=GC S_AB_structBin        00000000111111111
#=GC N_root               11111111000000000
//

  • HOwever, transitions between N states are ok:

# STOCKHOLM 1.0
#=GF NH                    ((A:0.1000000000,B:0.1000000000)AB:0.1000000000,(C:0.1000000000,D:0.1000000000)CD:0.1000000000)root;
A                          ATTAAACGAACTCTGGT
B                          AGAACACGAAGCATATT
C                          AAAAAACGATCCTTTTT
D                          AAAAAACGATCCTTTTT
#=GC N_AB              00000000000111111
#=GC N_root               11111111111000000
#=GC N_CD                  00000000000000000
#=GC S_AB_structBin        00000000000000000
#=GC S_CD_structBin        00000000000000000
#=GC S_root_structBin      00000000000000000


//

Solution to previous example

  • The rule i_N* -> j_S k_N forces each structural region to be flanked by a nonstructural region(see examples below). This really ought to be changed in the production rules...

This alignment has nonzero likelihood:

[oscar@garibaldi cladeStructure]$ xrate -g grammars/cladeStructure.eg alignments/practice2.stk
# STOCKHOLM 1.0
#=GF NH                    ((A:0.1000000000,B:0.1000000000)AB:0.1000000000,(C:0.1000000000,D:0.1000000000)CD:0.1000000000)root;
#=GF SC_max_CladeStructure -149
A                          ATTAAACGAACTCTGGTG
B                          AGAACACGAAGCATATTG
C                          AAAAAACGATCCTTTTTG
D                          AAAAAACGATCCTTTTTG
#=GC N_AB                  111111111110000001
#=GC N_CD                  000000000000000000
#=GC N_root                000000000000000000
#=GC S_AB                  ..................
#=GC S_AB_structBin        000000000000000000
#=GC S_CD                  ...........<<-->>.
#=GC S_CD_structBin        000000000001111110
#=GC S_root                ..................
#=GC S_root_structBin      000000000000000000
//

Whereas this one evaluates to -inf:


[oscar@garibaldi cladeStructure]$ cat alignments/practice3.stk
# STOCKHOLM 1.0
#=GF NH                    ((A:0.1000000000,B:0.1000000000)AB:0.1000000000,(C:0.1000000000,D:0.1000000000)CD:0.1000000000)root;
A                          ATTAAACGAACTCTGGTG
B                          AGAACACGAAGCATATTG
C                          AAAAAACGATCCTTTTTG
D                          AAAAAACGATCCTTTTTG
#=GC N_AB              111111111110000000
#=GC N_root               000000000000000000
#=GC N_CD                  000000000000000000
#=GC S_AB_structBin        000000000000000000
#=GC S_CD_structBin        000000000001111111
#=GC S_root_structBin      000000000000000000

  • Possible workaround:
    • "top level" nonterminal which spits out structured and nonstructured nonterminals. Do away with transformations between S and N states
      START -> Top
      Top -> i_N_Top_bif -> i_N Top
      Top -> i_S_Top_bif -> i_S Top
      Top -> END
-- OscarWestesson - 01 Oct 2009

Topic revision: r16 - 2009-10-23 - OscarWestesson
 

This site is powered by the TWiki collaboration platformCopyright © 2008-2013 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
TWiki Appliance - Powered by TurnKey Linux