Main >
TwelveFlyScreenArchive
caf1screen_v01
How to re-score hits using other models
The following is useful for re-scoring hits from this screen under other models
(e.g. to get log-odds scores).
This applies
only to this screen.
Later screens use other means that simplify this process (e.g. storing tree in database).
Get alignments corresponding to hits
Dump all hits, in alignment coordinates, to GFF file.
We force strand to
+ so that
alignmonkey.pl doesn't revcomp when extracting alignments downstream.
Remember, alignments presented to xrate were never revcomped; strand selection was built into the model, so to re-create the original result we'll need to give xrate the
original alignment and use the
#=GC STRAND annotation to specify which strand to use.
TODO: why are we getting slightly more hits this way than in
hitsAlign_ncRnaDualStrand.dmel.align.gff?
/nfs/projects/pipeline/perl/dump-to-gff.pl -h sheridan -d caf1screen_v01 -t hitsAlign_ncRnaDualStrand --seqid segment --start start --end end strand=+ > /tmp/v01.gff
Now, extract the hit alignments.
Make sure you SSH into
kosh, otherwise
alignmonkey.pl will consume all your machine's memory and crash!
We also replace
id with
ID for compatibility with a downstream script (
get-trees-into-hits.pl).
ssh kosh
cat /tmp/v01.gff | /nfs/src/mercator-perl/alignmonkey.pl -align segment.stock -mavid /nfs/projects/caf1screen/sge/caf1screen_v01/in | sed 's/;id=/;ID=/' > /tmp/v01.stock
Insert trees into hit alignments
Get trees from the window in which the hit was produced
(i.e. trees
after xrate adjusted the branch length for the window!)
/nfs/projects/pipeline/perl/get-trees-into-hits.pl /tmp/v01.stock /nfs/projects/caf1screen/sge/caf1screen_v01/out/ 4684 ncRnaDualStrand.xrate > /tmp/v01.stock.withtree
Now we replace the strand annotation in GFF col 7 (that we formerly forged so that
alignmonkey.pl wouldn't take the revcomp) with the proper annotation taken from the
strand attribute in col 9.
cat /tmp/v01.stock.withtree | perl -pe 'use strict; use warnings; if (/^(\#=GF.+\.\t)[+-](\t\..+strand=)([+-])(.+)$/) {$_ = join ("",$1,$3,$2,$3,$4,"\n")}' > /tmp/v01.stock.withtree.strandfixed
mv /tmp/v01.stock.withtree.strandfixed /tmp/v01.stock.withtree
Re-score using another model
You are now ready to do that.
When running on xrate, make sure you
do NOT re-adjust the branch lengths!
The trees we are using came from the original window, i.e. after the adjustment was already done.
That adjustment was done with an independent model (
dart/grammars/jukescantor.eg) and for consistency, you must use this tree on all models that produce an annotation.
Recover original annotation
You can also recover the original annotation to re-score using the original model, or another model that produces annotations in the same fashion.
Here is how to restore the
#=GC SS_cons and
#=GC STRAND Stockholm metadata.
It is stored in GFF field 9 attributes, so we extract it using a Perl one-liner.
cat v01.stock.withtree | perl -pe 'if (/secStruct=([_<>]+);strand=([+-]);/) { $ssCons = $1; $strand = $2 x length $ssCons; $_ .= "#=GC SS_cons $ssCons\n#=GC STRAND $strand\n" }' > /tmp/v01.stock.xrate
Now you can re-run the alignment through xrate again.
caf1screen_v00
Our first crack at a genome screen.
Suffers from a lot of problems, so I'm rerunning it as
#caf1screen_v01.
Regardless, I'm listing some preliminary results from it.
These should give us a baseline against which to measure progress.
- Starting point is Mavid/Mercator 12-fly alignment
- alignment has 4684 orthology segments, but only 4136 segments (those that have dmel) are screened
- split into 1,389,831 windows (120 cols each, step 40 cols)
To Do
These should be done for subsequent screens, also.
- Map joined-up hits to release 5 D.mel coords (
); send to Chris Smith (?)
- cluster our hits using an all-vs-all BLAST (& potentially stemloc)
- BLAST our hits against transposon families
- start putting xrate hits and Affy/Nimblegen transcript data into the genome browser so we can start investigating this all manually (and provide Mitch with some local users...)
Potentially would be fun to do, but not a priority:
- generate secondary structure plots for Dmel predictions for:
- eyeballing prediction quality
- potentially putting them into an AJAX GBrowse track.
Problems and improvement suggestions
- Used a 9-species tree (/nfs/data/genome/fly/tree/Pollard-median-time.hacked-CAF1.nh), which resulted in addtree.pl silently dropping the other 3 flies (dper, dsec, dwil); effectively, we ran a 9-fly screen instead of 12-fly screen.
- should be using /nfs/data/genome/fly/tree/median.nh tree from now on
- addtree.pl has been fixed to give a loud warning message about dropping sequences
- Dropped 548 Mavid/Mercator segments because they didn't have dmel sequence in them; shouldn't do that, as we might want to do other flies also (e.g. for the Nimblegen dataset)
- Filtered out a lot of windows using RNAz heuristics; shouldn't do that, our methods are different enough to not worry about that. Also, a lot of filtering was based on lack of dmel sequence - shouldn't do that, because we care about other flies, too.
- Should collect data for each window, then we can see if crappy alignments are even a problem - maybe they're not affecting quality at all.
- Storing appx. 1.4 million windows (2.8 million counting rev comps) on disk is wasteful and idiotic.
- windowlicker.pl has been written to generate windows (and their rev comps) on-the-fly and pass them to xrate
Lessons learned
- We don't really know much about coverage of the whole genome alignments, which makes it harder to say why the sensitivity is low. I did some estimates, but coverage is something to think about when doing these screens.
- Overly descriptive column names that lead to stupid errors, especially when doing intersects (for example, why say "mercator_refseq_name" for a column name when just "refseq" will do? this kind of verbosity leads to more mistakes than it fixes)
- perhaps start using Bio::DB::SeqFeature, or some other standardized schema? take a deep dive into Chado?
- 100,000 refseq naming conventions... this needs to be fixed EARLY IN THE PROJECT and at the SOURCE DATA LEVEL - make them all standard and make an "aliases" table for converting one to another, maybe with a little Perl wrapper for parsing it from any conceivable case... (Makefile, etc.)
- In a hurry, wrote my own table intersection code rather than take the time to figure out how to optimize the MySQL built-in stuff... which requires a bit of wizardry, but could save a LOT of time...
- Arguably, my intersect code has extra features (fractions of overlap computations)... but isn't there a way to do this kind of simple computation strictly in MySQL though? You can do math... can you do conditionals (if/then/else)?
Preliminary results
putative structurally conserved RNAs per "ncRnaDualStrand" model
raw hit counts
9248 unique hits (includes overlapping hits) in Mavid/Mercator alignment (see database.table
caf1screen.hits_ncRnaDualStrand).
Because a lot of species have gap stretches in M/M alignment, here is how the hits project down to the individual species (basically, we throw out all species hits with 100% gaps):
| species |
number of hits |
database.table |
| dana |
8482 |
caf1screen.hits_ncRnaDualStrand_dana |
| dmel |
9225 |
caf1screen.hits_ncRnaDualStrand_dmel |
| dmoj |
7777 |
caf1screen.hits_ncRnaDualStrand_dmoj |
| dpse |
8163 |
caf1screen.hits_ncRnaDualStrand_dpse |
| dsim |
8210 |
caf1screen.hits_ncRnaDualStrand_dsim |
| dvir |
7787 |
caf1screen.hits_ncRnaDualStrand_dvir |
| dyak |
8920 |
caf1screen.hits_ncRnaDualStrand_dyak |
NOTE that in the
above table:
- only 7 species are shown here because those are the ones for which we have transcription data
- no repeat masking has been done, so these hit counts are inflated
- this is a very crude estimate: no filtering has been done based on, e.g.:
- how well the individual species sequence aligns to the consensus structure
- how gappy the species sequence is (clearly a hit containing
A----------------U as its whole sequence is not correct)
- re-scoring the window using larger bounds
known ncRNAs/functional RNAs found
Define
known ncRNA as FlyBase
dmel release 4.3 features of type:
ncRNA.
Our hits overlap with
13 out of 116 known ncRNAs (11.2%).
Redefine
known ncRNA to also include features of types:
rRNA,
miRNA,
snoRNA,
tRNA,
tRNAscan-SE (I consider it to be good enough to count alongside experimental predictions),
snRNA.
Now, our hits overlap
63 out of 974 known ncRNAs (6.5%).
153 hits also landed in 3' UTRs (there are 13,561 3' UTRs in FlyBase for dmel)
The complete set of known ncRNAs and 3' UTRs is in database.table
caf1screen.known_ncRNA_dmel.
The intersection data between our
dmel predictions and that is in database.table
isect_known_affy_VS_ncRnaDualStrand_dmel.
correlation with other computational predictions
Specifically, with
these (choice Rfam models, snoRNA, tRNA, and miRNA pipeline predictions) - stored in database.table
caf1screen.predictions_Rfam_dmel.
We find
88 out of 842 ncRNAs (10.5%) predicted by other computational methods.
Our predictions are mostly tRNA, snoRNA, with some miRNA and ncRNA sprinkled in.
(see database.table
caf1screen.isect_predictions_Rfam_VS_ncRnaDualStrand_dmel).
Please note that only a small number of covariance models in Rfam was actually run on
dmel, which is why more transcripts aren't coming up.
Details on what goes into the Rfam et al AAA predictions are
ere.
correlation with experimental data
- Affymetrix tiling array data for dmel
- 496,587 out of 504,456 transfrags (98.4%) are in the Mavid/Mercator alignment
- 5319 out of 9225 dmel hits (57.6%) overlap with a transfrag
- 27.6% of euchromatic, nonrepeat dmel sequence is covered by a transfrag... does this represent significant enrichment?
- Nimblegen expression arrays for 6 species
- probes are made for gene predictions, including ESTs (details here, at the bottom of the page)
- >95% of the Nimblegen probes are in the Mavid/Mercator alignment
- although I have no idea how many made it into the windows, as about 10% of the segments got lost due to not having dmel as ref seq... so this could be dropping probes
- below, overlapping = overlapping with an expressed Nimblegen probe
Once again, note that due to no repeat masking or removal of "crappy" predictions, this percentage is really
lower than what it should be.
-- Created by:
AndrewUzilov on 15 Sep 2007

Copyright © 2008-2013 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback