|
|
RNA models
See Rna Models.
Suggestions of improvement
From former ticket #71 ("Estimate basepair substitution rates from high-quality ncRNA alignments") by Ian, which is along the same lines:
retrain the PFOLD rates on a subset of RFAM and possibly other ncRNA alignments...
not ALL the published structures (as we did before), but the ones that have high-quality basepairs...
e.g. rRNA, tRNA (NB not all rRNA is in RFAM)
Written for the ncRnaDualStrand model, but apply in general.
TODO: organize! Some of these have been implemented - write up descriptions for them below.
New list from Twelve Fly Screen:
- Force one hit per window
- We will wind up with as many hits as we have windows... how do we pick the "winners"?
- Raise probability of intergenic->RNA transition state
- Length-normalize the CYK score
- Unrealistically small ncRNAs clearly have the highest log P(CYK), preventing us from using it as an effective metric of hit confidence
- Use Inside algorithm score
- Sensitivity shoots up to almost max, but specificity is poor... but perhaps the intergenic->RNA state can be lowered, although it is ridiculously low already
- Use chain/rates trained on Rfam (from dart/grammars/rfam.eg)
- Seems to raise sensitivity for all ncRNAs except tRNA
- Simplify the model
- Reduce number of parameters (e.g. reduce base pair rate matrix to 4 rates: base pair conservation, loss, gain, and no base pair to no base pair), retrain
- Take existing Pfold rate matrix, add scaling factors corresponding to reduced parameters, retrain
- Retrain the model to better fit fly data
- Re-train on Rfam, but only on seqs whose evolutionary distance is similar to that of the flies
- Parameterize on known fly ncRNAs as they appear in the whole genome alignment (leaving a fraction out for sensitivity estimates)
- Perhaps use previous Rfam-trained model to get pseudocounts? There might be a sparse data problem with only a few hundred closely-related ncRNAs in flies to train on.
Old list from this page:
- Use Inside score instead of CYK
- No problem from the P(structure) term in P(evol)P(structure)
- Add a column to all tables with Inside results vs CYK
- Add a table for the Rfam-trained model from Klosterman et al (cf Pfold model, should do better)
- Add the following metrics to tests on knowns (for correlating with how well we do):
- Degree of conservation: identity
- Number of sequences (having at least 1nt and >25% non-gappy) in alignment
- Alignment quality: does an Infernal alignment/structure result in a higher xrate score?
- Now we are using xrate as a scoring method alone instead of structure prediction/scoring
ncRnaDualStrand
Part of dart (see grammars/ncRnaDualStrand.eg).
ncRnaDualStrand: Description
- Picks out 0 or more structurally conserved RNA in the input alignment.
- Has an intergenic "spacer" state (n).
- Does both strands, no need for rev comp.
- Down side is that transcripts from complementary strands won't both be annotated, but they rarely occur.
ncRnaDualStrand: Training Procedure
See Twelve Fly Scan. (?)
ncRnaDualStrand: Tests
Get the data/etc. via:
- Cluster NFS
- /mnt/nfs/users/avu/12fly-analysis/model-debugging/
- CVS
-
cvs -d:ext:cvs.biowiki.org:/usr/local/cvs co 12fly-analysis
- cd 12fly-analysis/model-debugging/
- Extremely low sensitivity (around 10%, depending on how you count).
- Model does not like switching to the structural RNA state, prefers to annotate everything intergenic.
- Specificity not yet estimated, except in crude tests, but seems to be alright.
Tests on known ncRNAs extracted from Mavid/Mercator alignment
The Question: How will the model perform when given the exact multiple alignment?
Structural RNAs from Fly Base were extracted from the CAF1 Mavid/Mercator alignment to determining model sensitivity.
The really long ones (over 500 columns) were dropped for tractability.
A lot of those alignments look crappy (see 12fly-analysis/model-debugging/*.stk), which accounts for some fraction of the misses.
TODO: quantify this!
However, I think that the low sensitivity is primarily the result of transition probabilities to the RNA states being too low.
Relevant excerpts from ncRnaDualStrand.eg:
;; state 0: START
(transform (from (START)) (to ()) (prob 0.0001))
(transform (from (START)) (to (FWD_RNA_GENE)) (prob 0.00005))
(transform (from (START)) (to (REV_RNA_GENE)) (prob 0.00005))
(transform (from (START)) (to (INTERGENIC)) (prob 0.9998))
and
;; state 4: INTERGENIC
(transform (from (INTERGENIC)) (to (X INTERGENIC*))
(annotate (row SS_cons) (column X) (label n)))
(transform (from (INTERGENIC*)) (to ()) (prob 0.000262754))
(transform (from (INTERGENIC*)) (to (FWD_RNA_GENE)) (prob 0.0136401))
(transform (from (INTERGENIC*)) (to (REV_RNA_GENE)) (prob 0.0136401))
(transform (from (INTERGENIC*)) (to (INTERGENIC)) (prob 0.972655))
Procedure
- Alignments scored with the original model which combines intergenic and structural RNA states.
- Then, original model split into two:
- structural RNA states only
- intergenic state only
- Each sub-model used to score the alignments.
- Log-odds (
lg P(all-structural-RNA)/lg P(all-intergenic)) taken to see if the RNA model dominated the intergenic (null) model, and by how much.
Results
Original ncRnaDualStrand model
The 1-model approach.
- TP = true positive = has at least 1 non-intergenic col
- FN = false negative = all cols annotated intergenic
Now let's try it using log-odds of Inside score under ncRnaDualStrand vs Inside score (which should be same as CYK score) under null model.
We also do naive specificity tests.
- based on real ncRNA data:
- TP = true positive = log-odds > 0
- FN = false negative = log-odds <= 0
- based on segment 1 of M/M alignment (assumed to be all negatives)
- FP = false positive = log-odds > 0
- TN = true negative = log-odds <= 0
Log-odds of all-structural versus all-intergenic models
- positive = log-odds > 0
- negative = log-odds <= 0
Specificity tests were done by scoring genome screen windows of segment 1 of the Mavid/Mercator and making the assumption that no windows contained ncRNA (so, every hit is a false positive).
This is a somewhat naive test, but gives us a ballpark specificity figure.
structural part of ncRnaDualStrand w/original chain vs null (intergenic) model
Data above obtained using:
FILL ME
Data above obtained using:
cd /nfs/projects/12fly-analysis/model-debugging/
G=ncRnaDualStrand-ncRNA_only.eg make windows-mavid-seg1.xrate
G=ncRnaDualStrand-intergenic_only.eg make windows-mavid-seg1.xrate
M=windows-mavid-seg1.xrate__ncRnaDualStrand-ncRNA_only.eg N=windows-mavid-seg1.xrate__ncRnaDualStrand-intergenic_only.eg make windows-mavid-seg1.ncRnaDualStrand-ncRNA_only-VS-ncRnaDualStrand-intergenic_only.CYK.summary
M=windows-mavid-seg1.xrate__ncRnaDualStrand-ncRNA_only.eg N=windows-mavid-seg1.xrate__ncRnaDualStrand-intergenic_only.eg make windows-mavid-seg1.ncRnaDualStrand-ncRNA_only-VS-ncRnaDualStrand-intergenic_only.Inside.summary
cat windows-mavid-seg1.ncRnaDualStrand-ncRNA_only-VS-ncRnaDualStrand-intergenic_only.CYK.summary
cat windows-mavid-seg1.ncRnaDualStrand-ncRNA_only-VS-ncRnaDualStrand-intergenic_only.Inside.summary
structural part of ncRnaDualStrand w/chain from dart/grammars/rfam.eg (i.e. rates trained on Rfam) vs null (intergenic) model
Data above obtained using:
FILL ME
Data above obtained using:
cd /nfs/projects/12fly-analysis/model-debugging/
G=rfam.eg make windows-mavid-seg1.xrate
G=ncRnaDualStrand-intergenic_only.eg make windows-mavid-seg1.xrate
M=windows-mavid-seg1.xrate__rfam.eg N=windows-mavid-seg1.xrate__ncRnaDualStrand-intergenic_only.eg make windows-mavid-seg1.rfam-VS-ncRnaDualStrand-intergenic_only.CYK.summary
M=windows-mavid-seg1.xrate__rfam.eg N=windows-mavid-seg1.xrate__ncRnaDualStrand-intergenic_only.eg make windows-mavid-seg1.rfam-VS-ncRnaDualStrand-intergenic_only.Inside.summary
cat windows-mavid-seg1.rfam-VS-ncRnaDualStrand-intergenic_only.CYK.summary
cat windows-mavid-seg1.rfam-VS-ncRnaDualStrand-intergenic_only.Inside.summary
Details of Rfam training procedure
For the RNA part of the v13 grammar, I used the transition
probabilities and rate matrices that were estimated for our 2006 paper
on xrate. These are actually the same as the "rfam.eg" grammar
distributed with DART (which is also the same grammar Rob and Gabriel
used to measure for the massive 12-fly paper). So, the complete
details of the training procedure, such as seed grammar, etc., are in
the supplementary material for that paper (specifically Section 2.3).
However, here is the executive summary of the training procedure:
- Started with Rfam v7 seed alignments, only those annotated
PUBLISHED on #=GF SS (there were 151 total).
- Partitioned those alignments into 13 "superfamilies" according to
the Rfam classification (e.g. IRES, riboswitch, miRNA, ribozyme, rRNA,
guide snRNA, splicing snRNA, etc.). tRNA and Group I intron dropped
out, since they had only one alignment per superfamily. Another
alignment (Rfam ID RF00061) was removed because it made Pfold crash.
Now you have 148 alignments in 13 superfamilies, each with >=2
alignment per superfamily.
- Alignments in each superfamily were randomly split into two
more-or-less equal groups. One group went into the training set (71
alignments), the other into the testing set (77 alignments).
- Pseudoknotted helices (i.e. those annotated with AAA...aaa,
BBB...bbb, CCC...ccc base pairs as opposed to the primary <<<>>>) were
removed prior to parameterizing the model since the SCFG can't handle
them.
When I say "split" I'm not talking about splitting any single
alignment (that would be silly), but rather partitioning a set of
alignments in a superfamily into two subsets. So the training and
testing sets are somewhat similar (in the sense that e.g. both contain
members of the riboswitch "superfamily"), but there is no family (i.e.
Rfam alignment with an RF ID) that is in both sets. So for example 5S
rRNA will be only in the training or only in the testing set (don't
remember where that one actually landed), but of course not both,
although you'll have ribosomal RNA of some kind in both sets.
So, an even more terse summary:
The training set is a SUBSET of about half (71 out of 151) of Rfam v7
"published" seed alignments. It is a subset because the remaining
alignments were used for testing the model accuracy for Klosterman
2006. Pseudoknotted helices were removed prior to training. Despite
only half on Rfam "published" alignments being in the training set,
each Rfam "superfamily" is fairly represented because about half of
each "superfamily" is included, except for tRNA and Group I introns,
which are completely excluded since their "superfamilies" can't be
split in two.
BTW I put "superfamilies" in quotes because it's a term I invented
since I don't know what Rfam actually calls them, if anything. If you
want more clarification on that, please let me know. Here is the
exact list of the alignments that were used to train the model:
/nfs/projects/xgram-paper/ncrna/training-set-files.list
(each dir path is a "superfamily," basically; some are more specific
than others, e.g. the "cis-reg/frameshift/" superfamily vs. "cis-reg/"
superfamily, with the latter basically being "cis-reg RNAs that didn't
fit into any more specific sub-category")
Note that all of the above is for the RNA part of the grammar, i.e.
not the null model, which was done separately.
If you want to see the grammar itself, it's this one:
/nfs/projects/caf1screen/grammars/ncRnaDualStrand_v13.eg
The comments in its header contain some more details that you might find useful.
-- Created by: AndrewUzilov on 11 May 2007 |