Rna Model Discussion

From Biowiki
(Redirected from RnaModelDiscussion)
Jump to: navigation, search

RNA models

See Rna Models.

Suggestions of improvement

From former ticket #71 ("Estimate basepair substitution rates from high-quality ncRNA alignments") by Ian, which is along the same lines:

retrain the PFOLD rates on a subset of RFAM and possibly other ncRNA alignments... not ALL the published structures (as we did before), but the ones that have high-quality basepairs... e.g. rRNA, tRNA (NB not all rRNA is in RFAM)

Written for the ncRnaDualStrand model, but apply in general.

TODO: organize! Some of these have been implemented - write up descriptions for them below.

New list from Twelve Fly Screen:

  • Force one hit per window
    • We will wind up with as many hits as we have windows... how do we pick the "winners"?
  • Raise probability of intergenic->RNA transition state
  • Length-normalize the CYK score
    • Unrealistically small ncRNAs clearly have the highest log P(CYK), preventing us from using it as an effective metric of hit confidence
  • Use Inside algorithm score
    • Sensitivity shoots up to almost max, but specificity is poor... but perhaps the intergenic->RNA state can be lowered, although it is ridiculously low already
  • Use chain/rates trained on Rfam (from dart/grammars/rfam.eg)
    • Seems to raise sensitivity for all ncRNAs except tRNA
  • Simplify the model
    • Reduce number of parameters (e.g. reduce base pair rate matrix to 4 rates: base pair conservation, loss, gain, and no base pair to no base pair), retrain
    • Take existing Pfold rate matrix, add scaling factors corresponding to reduced parameters, retrain
  • Retrain the model to better fit fly data
    • Re-train on Rfam, but only on seqs whose evolutionary distance is similar to that of the flies
    • Parameterize on known fly ncRNAs as they appear in the whole genome alignment (leaving a fraction out for sensitivity estimates)
      • Perhaps use previous Rfam-trained model to get pseudocounts? There might be a sparse data problem with only a few hundred closely-related ncRNAs in flies to train on.

Old list from this page:

  • Use Inside score instead of CYK
    • No problem from the P(structure) term in P(evol)P(structure)
    • Add a column to all tables with Inside results vs CYK
  • Add a table for the Rfam-trained model from Klosterman et al (cf Pfold model, should do better)
  • Add the following metrics to tests on knowns (for correlating with how well we do):
    • Degree of conservation: identity
    • Number of sequences (having at least 1nt and >25% non-gappy) in alignment
  • Alignment quality: does an Infernal alignment/structure result in a higher xrate score?
    • Now we are using xrate as a scoring method alone instead of structure prediction/scoring

ncRnaDualStrand

Part of dart (see grammars/ncRnaDualStrand.eg).

ncRnaDualStrand: Description

  • Picks out 0 or more structurally conserved RNA in the input alignment.
  • Has an intergenic "spacer" state (n).
  • Does both strands, no need for rev comp.
    • Down side is that transcripts from complementary strands won't both be annotated, but they rarely occur.

ncRnaDualStrand: Training Procedure

See Twelve Fly Scan. (?)

ncRnaDualStrand: Tests

Get the data/etc. via:

  • Cluster NFS
    • /mnt/nfs/users/avu/12fly-analysis/model-debugging/
  • CVS
    • cvs -d:ext:cvs.biowiki.org:/usr/local/cvs co 12fly-analysis
    • cd 12fly-analysis/model-debugging/

Application to Twelve Fly Screen

  • Extremely low sensitivity (around 10%, depending on how you count).
    • Model does not like switching to the structural RNA state, prefers to annotate everything intergenic.
  • Specificity not yet estimated, except in crude tests, but seems to be alright.

Tests on known ncRNAs extracted from Mavid/Mercator alignment

The Question: How will the model perform when given the exact multiple alignment?

Structural RNAs from Fly Base were extracted from the CAF1 Mavid/Mercator alignment to determining model sensitivity. The really long ones (over 500 columns) were dropped for tractability.

A lot of those alignments look crappy (see 12fly-analysis/model-debugging/*.stk), which accounts for some fraction of the misses. TODO: quantify this!

However, I think that the low sensitivity is primarily the result of transition probabilities to the RNA states being too low. Relevant excerpts from ncRnaDualStrand.eg:

 ;; state 0: START
 (transform (from (START)) (to ()) (prob 0.0001))
 (transform (from (START)) (to (FWD_RNA_GENE)) (prob 0.00005))
 (transform (from (START)) (to (REV_RNA_GENE)) (prob 0.00005))
 (transform (from (START)) (to (INTERGENIC)) (prob 0.9998))

and

 ;; state 4: INTERGENIC
 (transform (from (INTERGENIC)) (to (X INTERGENIC*))
				(annotate (row SS_cons) (column X) (label n)))
 (transform (from (INTERGENIC*)) (to ()) (prob 0.000262754))
 (transform (from (INTERGENIC*)) (to (FWD_RNA_GENE)) (prob 0.0136401))
 (transform (from (INTERGENIC*)) (to (REV_RNA_GENE)) (prob 0.0136401))
 (transform (from (INTERGENIC*)) (to (INTERGENIC)) (prob 0.972655))

Procedure

  • Alignments scored with the original model which combines intergenic and structural RNA states.
  • Then, original model split into two:
    • structural RNA states only
    • intergenic state only
  • Each sub-model used to score the alignments.
    • Log-odds (lg P(all-structural-RNA)/lg P(all-intergenic)) taken to see if the RNA model dominated the intergenic (null) model, and by how much.

Results

Original ncRnaDualStrand model

The 1-model approach.

  • TP = true positive = has at least 1 non-intergenic col
  • FN = false negative = all cols annotated intergenic
FlyBase category TPs FNs % sensitivity
miRNA 0 66 0.0
ncRNA 1 49 2.0
rRNA 0 96 0.0
snoRNA 6 52 10.3
snRNA 2 40 4.8
tRNA 3 273 1.1

Now let's try it using log-odds of Inside score under ncRnaDualStrand vs Inside score (which should be same as CYK score) under null model. We also do naive specificity tests.

  • based on real ncRNA data:
    • TP = true positive = log-odds > 0
    • FN = false negative = log-odds <= 0
FlyBase category TPs (Inside) FNs (Inside) % sensitivity (Inside)
miRNA 66 0 100.0
ncRNA 46 4 92.0
rRNA 6 90 6.3
snoRNA 51 7 87.9
snRNA 37 5 88.1
tRNA 253 23 91.7
  • based on segment 1 of M/M alignment (assumed to be all negatives)
    • FP = false positive = log-odds > 0
    • TN = true negative = log-odds <= 0
FPs (Inside) TNs (Inside) % specificity (Inside)
387 228 37.1%
Log-odds of all-structural versus all-intergenic models
  • positive = log-odds > 0
  • negative = log-odds <= 0

Specificity tests were done by scoring genome screen windows of segment 1 of the Mavid/Mercator and making the assumption that no windows contained ncRNA (so, every hit is a false positive). This is a somewhat naive test, but gives us a ballpark specificity figure.

structural part of ncRnaDualStrand w/original chain vs null (intergenic) model

FlyBase category TPs (CYK) FNs (CYK) % sensitivity (CYK) TPs (Inside) FNs (Inside) % sensitivity (Inside)
miRNA 63 3 95.5
ncRNA 26 24 52.0
rRNA 0 96 0.0
snoRNA 8 50 13.8
snRNA 1 41 2.4
tRNA 176 100 63.8

Data above obtained using:

FILL ME
FPs (CYK) TNs (CYK) % specificity (CYK) FPs (Inside) TNs (Inside) % specificity (Inside)
1 614 99.8 123 492 80.0

Data above obtained using:

cd /nfs/projects/12fly-analysis/model-debugging/
G=ncRnaDualStrand-ncRNA_only.eg make windows-mavid-seg1.xrate
G=ncRnaDualStrand-intergenic_only.eg make windows-mavid-seg1.xrate
M=windows-mavid-seg1.xrate__ncRnaDualStrand-ncRNA_only.eg N=windows-mavid-seg1.xrate__ncRnaDualStrand-intergenic_only.eg make windows-mavid-seg1.ncRnaDualStrand-ncRNA_only-VS-ncRnaDualStrand-intergenic_only.CYK.summary
M=windows-mavid-seg1.xrate__ncRnaDualStrand-ncRNA_only.eg N=windows-mavid-seg1.xrate__ncRnaDualStrand-intergenic_only.eg make windows-mavid-seg1.ncRnaDualStrand-ncRNA_only-VS-ncRnaDualStrand-intergenic_only.Inside.summary
cat windows-mavid-seg1.ncRnaDualStrand-ncRNA_only-VS-ncRnaDualStrand-intergenic_only.CYK.summary
cat windows-mavid-seg1.ncRnaDualStrand-ncRNA_only-VS-ncRnaDualStrand-intergenic_only.Inside.summary

structural part of ncRnaDualStrand w/chain from dart/grammars/rfam.eg (i.e. rates trained on Rfam) vs null (intergenic) model

FlyBase category TPs (CYK) FNs (CYK) % sensitivity (CYK) TPs (Inside) FNs (Inside) % sensitivity (Inside)
miRNA 65 1 98.5 66 0 100.0
ncRNA 33 17 66.0 48 2 96
rRNA 0 96 0.0 95 1 99.0
snoRNA 12 46 20.7 47 11 81.0
snRNA 16 26 38.1 39 3 92.9
tRNA 87 189 31.5 271 5 98.2

Data above obtained using:

FILL ME
FPs (CYK) TNs (CYK) % specificity (CYK) FPs (Inside) TNs (Inside) % specificity (Inside)
25 590 95.9% 446 169 27.5%

Data above obtained using:

cd /nfs/projects/12fly-analysis/model-debugging/
G=rfam.eg make windows-mavid-seg1.xrate
G=ncRnaDualStrand-intergenic_only.eg make windows-mavid-seg1.xrate
M=windows-mavid-seg1.xrate__rfam.eg N=windows-mavid-seg1.xrate__ncRnaDualStrand-intergenic_only.eg make windows-mavid-seg1.rfam-VS-ncRnaDualStrand-intergenic_only.CYK.summary
M=windows-mavid-seg1.xrate__rfam.eg N=windows-mavid-seg1.xrate__ncRnaDualStrand-intergenic_only.eg make windows-mavid-seg1.rfam-VS-ncRnaDualStrand-intergenic_only.Inside.summary
cat windows-mavid-seg1.rfam-VS-ncRnaDualStrand-intergenic_only.CYK.summary
cat windows-mavid-seg1.rfam-VS-ncRnaDualStrand-intergenic_only.Inside.summary

Details of Rfam training procedure

For the RNA part of the v13 grammar, I used the transition probabilities and rate matrices that were estimated for our 2006 paper on xrate. These are actually the same as the "rfam.eg" grammar distributed with DART (which is also the same grammar Rob and Gabriel used to measure for the massive 12-fly paper). So, the complete details of the training procedure, such as seed grammar, etc., are in the supplementary material for that paper (specifically Section 2.3).

However, here is the executive summary of the training procedure:

- Started with Rfam v7 seed alignments, only those annotated PUBLISHED on #=GF SS (there were 151 total).

- Partitioned those alignments into 13 "superfamilies" according to the Rfam classification (e.g. IRES, riboswitch, miRNA, ribozyme, rRNA, guide snRNA, splicing snRNA, etc.). tRNA and Group I intron dropped out, since they had only one alignment per superfamily. Another alignment (Rfam ID RF00061) was removed because it made Pfold crash. Now you have 148 alignments in 13 superfamilies, each with >=2 alignment per superfamily.

- Alignments in each superfamily were randomly split into two more-or-less equal groups. One group went into the training set (71 alignments), the other into the testing set (77 alignments).

- Pseudoknotted helices (i.e. those annotated with AAA...aaa, BBB...bbb, CCC...ccc base pairs as opposed to the primary <<<>>>) were removed prior to parameterizing the model since the SCFG can't handle them.

When I say "split" I'm not talking about splitting any single alignment (that would be silly), but rather partitioning a set of alignments in a superfamily into two subsets. So the training and testing sets are somewhat similar (in the sense that e.g. both contain members of the riboswitch "superfamily"), but there is no family (i.e. Rfam alignment with an RF ID) that is in both sets. So for example 5S rRNA will be only in the training or only in the testing set (don't remember where that one actually landed), but of course not both, although you'll have ribosomal RNA of some kind in both sets.

So, an even more terse summary: The training set is a SUBSET of about half (71 out of 151) of Rfam v7 "published" seed alignments. It is a subset because the remaining alignments were used for testing the model accuracy for Klosterman 2006. Pseudoknotted helices were removed prior to training. Despite only half on Rfam "published" alignments being in the training set, each Rfam "superfamily" is fairly represented because about half of each "superfamily" is included, except for tRNA and Group I introns, which are completely excluded since their "superfamilies" can't be split in two.

BTW I put "superfamilies" in quotes because it's a term I invented since I don't know what Rfam actually calls them, if anything. If you want more clarification on that, please let me know. Here is the exact list of the alignments that were used to train the model: /nfs/projects/xgram-paper/ncrna/training-set-files.list (each dir path is a "superfamily," basically; some are more specific than others, e.g. the "cis-reg/frameshift/" superfamily vs. "cis-reg/" superfamily, with the latter basically being "cis-reg RNAs that didn't fit into any more specific sub-category")

Note that all of the above is for the RNA part of the grammar, i.e. not the null model, which was done separately.

If you want to see the grammar itself, it's this one: /nfs/projects/caf1screen/grammars/ncRnaDualStrand_v13.eg The comments in its header contain some more details that you might find useful.

---

-- Created by: Andrew Uzilov on 11 May 2007