Back to
TwelveFlyScreen for a top-level overview of this project.
Results
See
TwelveFlyScreen for descriptions of these screens.
dmel results
| caf1screen |
# of unique hits |
# of unique known ncRNAs found |
# of hits in Affy set |
# of "good hits" |
| v01 |
17,144 |
92 |
9,496 |
25 |
| v02 |
17,793 |
128 |
9,594 |
30 |
| v02_unfiltered |
32,601 |
140 |
13,697 |
74 |
| v03 |
3,080,801 |
need to filter down to "hits above cutoff" before can fill these, otherwise we have no prediction set (every window has a hit!) |
| v08 |
49,284 |
100 |
28,929 |
53 |
| v12 |
1,592,920 |
need to filter down to "hits above cutoff" before can fill these, otherwise we have no prediction set (every window has a hit!) TODO: pick cutoff |
| v13 |
1,592,920 |
need to filter down to "hits above cutoff" before can fill these, otherwise we have no prediction set (every window has a hit!) TODO: pick cutoff |
known ncRNAs found - note that more than one hit might overlap a single ncRNA, so we have to filter the table
isect_hitsGenomic_MODELNAME_dmel_VS_known_ncRNA it down to
unique ncRNAs found like this:
SELECT DISTINCT t2_start,t2_end,t2_strand,t2_note FROM isect_hitsGenomic_...
hits in Affy set = unique hits of any kind that are 100% in some Affy transfrag (selected using
SELECT DISTINCT refseq,t1_start,t1_end,t1_strand FROM isect_hitsGenomic_MODELNAME_VS_known_affy WHERE percent_t1_in_t2 = 100;)
good hits = unique hits that are 100% in some Affy transfrag, but don't overlap a known (i.e. novel discoveries), and also have >20nt in dmel and >=5 base pairs (based on the fact that smallest # of bp in a "Published" Rfam v7 structure is 5 bp); they are in table
goodhits3_...
Note that "known ncRNA" does not take strand into account when doing the comparison (so strands
don't have to match when calling it "overlap").
Known FlyBase ncRNAs are further filtered to <500nt, because there are some very long, odd entries... I am not sure what the source of these aberrations is, but they cannot be correct. (TODO: write FlyBase about this) (UPDATE: we may as well treat them as real, at least for filtering purposes:
http://biowiki.org/FlyBase#The_unrealistically_long_ncRNA_a)
There are 621 known ncRNAs in the "knowns" set (see
caf1screen.known_flybase_ncRNA_dmel).
v03 does not use filtering, and there are about 200x genomic hits cf with filtering, so it would be intractable to join that many w/the affy transfrags... a more efficient method is needed (we could merge affy transfrags and make them timepoint-independent, but it's only a 12x decrease... still need another order of magnitude). But
more importantly, it wouldn't tell you much - you can't report
all the hits when 1 hit/window is used, you need to filter down to a small "upper tier" and then intersect
that.
False-positive rate estimation
Given a particular score cutoff, we want to estimate our FPR. The top 5% cutoff (lgOdds > 16.397) which we chose for
the experimental verification list gives 13 hits from gotoh3, 4 hits from gotohid20 and 3 hits from gotohid21. Gotoh3 has 2466 hits,
so the FPR is approximately 0.005.
Simulation models
- gotoh3 - gotoh model with lexicalization size 3
- gotohid20 - gotoh model with two insertion/deletion states with lexicalization size 0
- gotohid21 - gotoh model with two insertion/deletion states with lexicalization size 1
We can use this FPR to estimate how many of our top 5% hits are false positives. Consider
hits which overlap mRNAs: 62.1% (67,163 of 108,168) of our top 5% of hits overlap a mRNA
(numbers from
http://biowiki.org/TwelveFlyScreen#Selecting_hits_for_experimental).
A query on filterGenomic with no lgOdds cutoff shows that 466,444 hits in total
overlap an mRNA. We therefore expect that 466,444 * FPR = 466,444 * 0.005 = 2460
of our top 5% of hits which overlap mRNAs are false positives.
The statistics needed to estimate FPRs on hits overlapping with annotations in
FlyBase
are then:
- mRNA: 466444 hits
- cDNA or EST: 560704 hits
- Affy: 216647 hits
- pseudogene: 565 hits
- ncRNA: 2426 hits
- 3' UTR: 24623 hits
- 5' UTR: 18569 hits
These numbers were obtained with queries on filterGenomic with no lgOdds cutoff.
Figures
ROC curves
And the same plot, with various hit percentage cutoffs labeled. The orange line is for 5% of hits. The grey lines represent 2% intervals.
Hits overlapping with known ncRNAs, score vs. percentage overlap, for caf1screen_v12
Hits against knowns where the known is longer than 900 bases are in faint red, the rest are in black.
For reference, the 5% cutoff was at lgOdds 16.397 (green line).
Note that the Y-axis (lgOdds) is different in each ncRNA type.
The "ncRNA" plot includes just the knowns that were labeled with the "ncRNA" SO term; it's disjoint with the others.
and a comparison of percentage of known in hit vs percentage of hit in known:
Comparing caf1screen_v12 and caf1screen_v13
hits detecting knowns vs rank cutoff
- caf1screen_v12=blue
- caf1screen_v13=red
The data points are
hit intersections with known annotation, NOT known annotations themselves.
So, because an annotation can get hit multiple times (very likely) or a hit can cross multiple annotations (much much much less likely but theoretically possible), there will be
more intersections than annotations.
There are 621 SS RNA annotations in the
FlyBase 5.1 dataset used to make this plot.
lgModelScore vs length
- sources:
- caf1screen_v12.hitsAlign_ncRnaDualStrand_v12 (data from 9/12/07 1:47pm)
- caf1screen_v13.hitsAlign_ncRnaDualStrand_v13
lgOdds vs length
scores of hits detecting known ncRNAs, overlaid on "lgOdds vs length" plot
histograms of lgModelScore, binned by length
- sources:
- caf1screen_12.hitsAlign_ncRnaDualStrand_v12, data from 9/12/07 ~3pm
- caf1screen_13.hitsAlign_ncRnaDualStrand_v13
histograms of lgOdds, binned by length
- sources:
- caf1screen_12.hitsAlign_ncRnaDualStrand_v12, data from 9/12/07 ~3pm
- caf1screen_13.hitsAlign_ncRnaDualStrand_v13
Preliminary Conclusions/Discussion
TODO: move to
TwelveFlyScreenDiscussion?
Pecan-based screens detect more known ncRNAs and find more "good hits" than Mavid-based screens, even accounting for the fact that Pecan produces slightly more hits total than Mavid.
(based on comparing v01 versus v02)
The RNAz-based 25% gap threshold is far, far too conservative and results in too many windows with known ncRNA being dropped due to reference sequence gappyness.
(based on comparing v02 versus v02_unfiltered)
Forcing 1 hit per window detects 85% of known ncRNAs.
This indicates that if a window contains a ncRNA, and you force xrate to annotate one hit in a window, it is likely that the hit will overlap with the known ncRNA (i.e., the model can pick up the location of a known ncRNA within the window, if forced to do so).
Ian and Rob don't believe this, as the probability of a forced annotation overlapping a known by chance is too high [AVU]. However, it has not been checked what scores these hits have and whether they would pass a post-filtering step based on a score cutoff.
Additionally, since each window will have a hit, the number of hits is extremely large and computing an intersect between them and large datasets (such as Affy) is intractable in practical time.
This can be resolved by filtering down to "top hits" before doing the intersect (using the hit scores), rather than filtering post-intersect.
This is a more fair way of evaluating your computational prediction method anyway (because you would not be using real data to filter out bad predictions), and should be done regardless.
(based on results of v03)
Using probabilities/rates estimated from Rfam (instead of the Pfold model) results in picking up
less known ncRNAs, although the total number of hits goes up dramatically.
Therefore, this model is less sensitive and less specific, therefore worse overall.
(based on comparing results of v02 versus v08)
-- Created by
AndrewUzilov on 15 Sep 2007