This is a top-level overview of the project.
See other pages for details.
TwelveFlyScreenPredictions
gives a short overview of our prediction sets.
See also
Selecting hits for experimental verification
Exact procedure
Specifically for caf1screen_v12/ncRnaDualStrand_v12.
We built a table (
top_hitsAlign_ncRnaDualStrand_v12) from the top 5% (108,168) of our hits by requiring:
- length >= 20
- dmel sequence length after gaps removed (not length in alignment columns)
- bp >= 5
- number of base pairs in the annotation (not base pairs in dmel)
- covar >= 2
- number of covariant columns as output by
colorstock.pl
- seqs >= 3
- number of sequences that are not 100% gaps in the hit annotation
- no overlap with a mRNA, cDNA or EST, ncRNA or pseudogene
- overlap with an Affy transfrag
Here's the SQL for it:
create table top_hitsAlign_ncRnaDualStrand_v12 select ha.* from hitsAlign_ncRnaDualStrand_v12 ha inner join
filterGenomic_ncRnaDualStrand_v12_dmel fg on ha.id=fg.id inner join
filterAlign_ncRnaDualStrand_v12 fa on fa.id=fg.id
where fg.length >= 20 and fg.ntInKnown = 0 and fg.ntInMRNA = 0 and fg.ntInClone = 0 and fg.ntInAffy > 0 and fg.ntInPseudo = 0
and fa.rank <= 108168 and fa.bp >= 5 and fa.covar >= 2 and fa.seqs >= 3;
This gave us 359 hits. We then eye-balled each of these hits using
colorstock.pl. We rejected 170 (field
rejected), most commonly due to gappiness in Dmel (or other species), presence of simple repeats or poor stem structure. We assigned a
gutscore (
1 = passable,
2 = good,
3 = warm fuzzy feeling) to the others and took the top-scoring 100 to compile the prediction list attached to this email.
We compiled a list of the top 100 hits by ranking accepted hits on
gutscore, then
lgOdds.
We've also compiled some statistics on our hits. All statistics were computed using the top 5% of hits (ranked by log-odds).
- 62.1% (67,163 of 108,168) overlap a mRNA
- 68.7% (74,292 of 108,168) overlap a cDNA or EST
- 35.2% (38,072 of 108,168) overlap an Affy transfrag
- 0.69% (741 of 108,168) overlap a ncRNA
- 0.01% (14 of 108,168) overlap a pseudogene
- 1.9% (2056 of 108,168) overlap a 5' UTR
- 3.5% (3,763 of 108,168) overlap a 3' UTR
Adjacent windows may generate overlapping hits.
Duplicate hits with exactly same coordinates (i.e. matching start and end) were removed in the SGE-to-database step, but we may still have overlap.
We create a new table for non-overlapping hits only:
First, copy into it hits that don't overlap with any other hits:
Second, copy hits that
do overlap, selecting from the overlapping pair the hit with the higher
gutscore:
INSERT INTO uniq
SELECT
# select the hit with the higher gutscore from the overlapping pair
IF(t1.gutscore >= t2.gutscore,t1.id,t2.id) AS id,
IF(t1.gutscore >= t2.gutscore,t1.gutscore,t2.gutscore) AS gutscore
FROM
# this join finds all pairs of overlapping hits
top_hitsAlign_ncRnaDualStrand_v12 AS t1
INNER JOIN
top_hitsAlign_ncRnaDualStrand_v12 AS t2
ON
t1.start <= t2.end
AND
t1.end >= t2.start
AND
t1.segment = t2.segment
AND
# every hit overlaps itself, which is an artifact;
this check will throw out hits that overlap themselves
t1.id <> t2.id
AND
# we only care about non-rejected hits overlapping with non-rejected hits
t1.gutscore IS NOT NULL
AND
t2.gutscore IS NOT NULL
AND
# when hit A overlaps hit B, two intersections will be produced:
# (A,B) and (B,A); this is redundant, as it is the same intersection,
# so we keep only the intersect where hit1.start < hit2.start;
# note that we reject the opposite case by return "0" from the
# "else" of the conditional, which will make the entire join
# conditional fail
IF(t1.start < t2.start,
1,
IF(t1.start = t2.start,
# if starts are the same, sort on end
IF (t1.end < t2.end,
1,
IF(t1.end = t2.end,
# if end is the same, the hits must have same coords but are
# on different strands; we arbitrarily keep the one on the
# plus strand because selecting the higher-scoring one is
# (1) pointless, since coords are the same, and (2) would add
# too much complexity just to take care of a single case
IF (t1.strand = '+', 1, 0),
0
)
),
0
)
)
ON DUPLICATE KEY UPDATE
# if hit A overlaps hit B and hit C and has a higher gutscore than either,
# then hit A will be inserted into "uniq" for the (A,B) intersection,
# then inserted again for the (A,C) intersection, leading to a duplicate
# key error and failing the query; to prevent this, we update duplicate
# keys with the one with a higher gutscore (the hit is exactly the same,
# but we should keep the more favorable opinion of it)
uniq.id = uniq.id,
uniq.gutscore = IF(uniq.gutscore >= t1.gutscore,uniq.gutscore,t1.gutscore)
;
Now add the columns we will want for the GFF dump:
ALTER TABLE uniq
ADD COLUMN (
segment INT UNSIGNED,
start INT UNSIGNED,
end INT UNSIGNED,
strand CHAR(1),
lgOdds FLOAT
)
;
UPDATE
uniq AS u,
hitsAlign_ncRnaDualStrand_v12 AS a
SET
u.segment = a.segment,
u.start = a.start,
u.end = a.end,
u.strand = a.strand,
u.lgOdds = a.lgOdds
WHERE
u.id = a.id
;
Use
alignmonkey.pl to get the sequences for our hits, because unlike
get-hit-alignments.pl it will revcomp sequence for hits on the minus strand:
cd /nfs/projects/caf1screen/gff/caf1screen_v12/
# get everything with a high gutscore
/nfs/projects/pipeline/perl/dump-to-gff.pl -h sheridan -d caf1screen_v12 -t uniq --seqid segment --start start --end end --strand strand --score lgOdds type=ncRNA --sql "WHERE gutscore >= 2" > for-experiment.ncRnaDualStrand_v12.gff
# how many was that?
grep -cv ^# for-experiment.ncRnaDualStrand_v12.gff
# for those with a low gutscore, sort by lgOdds, appending only enough to the list to make 100 hits
/nfs/projects/pipeline/perl/dump-to-gff.pl -h sheridan -d caf1screen_v12 -t uniq --seqid segment --start start --end end --strand strand --score lgOdds type=ncRNA --sql "WHERE gutscore = 1 ORDER BY lgOdds DESC" | grep -v ^# | head -n 22 >> for-experiment.ncRnaDualStrand_v12.gff
# alignmonkey need kosh for its memory
ssh kosh
/nfs/src/mercator-perl/alignmonkey.pl -gff for-experiment.ncRnaDualStrand_v12.gff -align segment.stock -mavid /nfs/projects/caf1screen/sge/caf1screen_v12/in/ > for-experiment.ncRnaDualStrand_v12.stock
# convert to FASTA
perl -e 'use strict; use warnings; use Stockholm::Database; my $in = Stockholm::Database->from_file ("for-experiment.ncRnaDualStrand_v12.stock"); foreach my $stk (@$in) {my ($id) = $stk->get_gf ("GFF") =~ /id=(\w+[+-])/; my $seq = $stk->seqdata->{DroMel_CAF1}; $seq =~ s/-//g; print ">$id\n$seq\n"}' > for-experiment.ncRnaDualStrand_v12.fasta
The uniq table doesn't have secondary structure information or window coordinates, so we get those by giving dump-to-gff.pl a join instead of just a table; e.g.:
/nfs/projects/pipeline/perl/dump-to-gff.pl -h sheridan -d caf1screen_v12 -t "uniq u inner join hitsAlign_ncRnaDualStrand_v12 h on u.id=h.id" --seqid segment --start start --end end --strand strand --score lgOdds type=ncRNA --sql "WHERE gutscore >= 2" > for-experiment-ss.ncRnaDualStrand_v12.gff
/nfs/projects/pipeline/perl/dump-to-gff.pl -h sheridan -d caf1screen_v12 -t "uniq u inner join hitsAlign_ncRnaDualStrand_v12 h on u.id=h.id" --seqid segment --start start --end end --strand strand --score lgOdds type=ncRNA --sql "WHERE gutscore = 1 ORDER BY u.lgOdds DESC" | grep -v ^# | head -n 22 >> for-experiment-ss.ncRnaDualStrand_v12.gff
This is the preferred way to do things (so that the gffs contain all the desired info).
Troubleshooting the SQL
This applies only to the table
caf1screen_v12.top_hitsAlign_ncRnaDualStrand_v12.
Later genome screens will yield different special case hits, but they should fall in the same
types of special cases, so no extra SQL will need to be written... hopefully.
When you're changing the above self-intersect SQL, make sure the following hits get done correctly.
They were the most troublesome ones that created the need for special checking.
-
2012_827708_827828_+ and 2012_827708_827826_-
- same start coord, so choosing on smaller start to fix (A,B) vs (B,A) problem doesn't work - that's why we introduced checking on end coord and, failing that, picking the plus strand hit
-
3057_192927_193052_+ and 3057_192927_193052_-
- same coordinates, except different strands
- also has a gutscore of 1 in one window (minus strand), gutscore of 3 in another (plus strand); ideally we should pick the higher one, but arbitrarily we pick the plus strand (lucky in this case, but this is not guaranteed)
-
2866_223529_223635_+ and 2866_223597_223635_+ and 2866_223608_223631_+
- the only case of 3 hits that all overlap each other (producing 6 intersections == 3 non-redundant intersections), and also have 3 different gut scores
-
2866_223529_223635_+ has the highest gut score, so make sure that is the only one that's kept
How to just view a list of overlapping hit pairs (for debugging):
SELECT
t1.id,t1.gutscore,t2.id,t2.gutscore
FROM
top_hitsAlign_ncRnaDualStrand_v12 AS t1
INNER JOIN
top_hitsAlign_ncRnaDualStrand_v12 AS t2
ON
t1.start <= t2.end
AND
t1.end >= t2.start
AND
t1.segment = t2.segment
AND
t1.id <> t2.id
AND
t1.gutscore IS NOT NULL
AND
t2.gutscore IS NOT NULL
;
Looking at the hits in GBrowse
Here's how to get the chromosome names and genomic coordinates of the top hits in order to check them out in
FlyBase GBrowse:
select map.genbank_name, hg.start, hg.end from top_hitsAlign_ncRnaDualStrand_v12 top inner join
hitsGenomic_ncRnaDualStrand_v12_dmel hg on top.id = hg.id inner join
caf1screen.map_dmel_scaffold_names map on hg.refseq = map.genbank_accnVer
where top.gutscore > 2;
Alternately, you could just take the sequence and use
FlyBase BLAST, then click on
BLAST HIT on Genome Map, which is faster if you have just one sequence.
Computational genome screens
Data for each screen is stored in a database with the same name as the screen (e.g.
caf1screen_v12).
Summary
Unless otherwise noted, all screens below use:
- intial tree from /nfs/data/genome/fly/tree/median.nh
- branch lengths then adjusted by xrate for each window
- the grammar is in /nfs/projects/12fly-analysis/grammars/
| screen |
SGE done |
alignment source |
model |
window size/step |
pre-filtering |
other notes |
| caf1screen_v00 |
|
Mavid/Mercator |
ncRnaDualStrand |
120/40 |
RNAz heuristics |
first crack at a screen |
| caf1screen_v01 |
|
Mavid/Mercator |
ncRnaDualStrand |
120/40 |
windowlicker: -g 0.25 -b 0 -r DroMel_CAF1 |
replay of v00 with slightly less filtering, using new pipeline tools (e.g. windowlicker) |
| caf1screen_v02 |
|
Pecan/Mercator |
ncRnaDualStrand |
120/40 |
windowlicker: -g 0.25 -b 0 -r DroMel_CAF1 |
|
| caf1screen_v02.unfiltered |
|
Pecan/Mercator |
ncRnaDualStrand |
120/40 |
none |
|
| caf1screen_v03 |
|
Mavid/Mercator |
ncRnaDualStrand_force1hit |
120/40 |
none |
analysis POSTPONED in favor of caf1screen_v12 |
caf1screen_v04 |
N/A |
Pecan/Mercator |
ncRnaDualStrand_force1hit |
120/40 |
none |
CANCELED in favor of other screens |
caf1screen_v05 |
N/A |
Mavid/Mercator |
ncRnaDualStrand |
300/100 |
none |
CANCELED in favor of other screens |
caf1screen_v06 |
N/A |
Pecan/Mercator |
ncRnaDualStrand |
300/100 |
none |
CANCELED in favor of other screens |
| caf1screen_v07 |
4578 out of 4684 jobs done |
Mavid/Mercator |
ncRnaDualStrand_Rfam |
120/40 |
none |
POSTPONED/FORSAKEN since Pecan is turning out better |
| caf1screen_v08 |
|
Pecan/Mercator |
ncRnaDualStrand_Rfam |
120/40 |
windowlicker: -g 0.25 -b 0 -r DroMel_CAF1 |
|
caf1screen_v09 |
N/A |
Mavid/Mercator |
ncRnaDualStrand_eqProb |
120/40 |
|
equal intergenic/ncRNA transition probs; CANCELED, partial run showed it to be total crap (>95% columns marked as ncRNA) |
caf1screen_v10 |
N/A |
Pecan/Mercator |
ncRnaDualStrand_eqProb |
120/40 |
|
equal intergenic/ncRNA transition probs; CANCELED, partial run showed it to be total crap (>95% columns marked as ncRNA) |
caf1screen_v11 |
skipping |
| caf1screen_v12 |
|
Pecan/Mercator |
ncRnaDualStrand_v12 |
300/100 |
windowlicker: -g 0.9 -b 0 -r DroMel_CAF1 |
Inside score for Pfold subgrammar, force 1 hit |
| caf1screen_v13 |
|
Pecan/Mercator |
ncRnaDualStrand_v13 |
300/100 |
windowlicker: -g 0.9 -b 0 -r DroMel_CAF1 |
just like v12, except using Rfam-trained params |
Screen details
caf1screen_v00
See
TwelveFlyScreenArchive.
caf1screen_v01
A rerun of
caf1screen_v00 that fixes some of its annoyances (i.e. losing 3 flies due to wrong tree, throwing away a lot of segments, filtering out many of the windows).
This will put this screen's data on an equal level with subsequent screens, allowing us to judge quality without bias from early, naive errors.
Another reason for this re-run is that this one uses the standardized, elegant
Makefile.sge and
windowlicker.pl, whereas
v00 used ad hoc, throw-away scripts and other non-reproducible crap.
- Mavid/Mercator alignment
- model: dart/grammars/ncRnaDualStrand.eg
- 120nt windows, step 40nt
- tree: /nfs/data/genome/fly/tree/median.nh
- branch lengths adjusted for each window
caf1screen_v12
- Pecan/Mercator alignment
- gap threshold: windows >90% gappy in dmel get dropped
- window size/step: 300/100
- model: /nfs/projects/12fly-analysis/grammars/ncRnaDualStrand_v12.eg
- forces exactly 1 hit
- minlen 20
- max non-suffix length = 130
- sum over Pfold states instead of max, getting Inside score
- intergenic null model score produced by xrate (see for lgInside(FWD_INTERGENIC) and lgInside(REV_INTERGENIC) in GFF col 9) in addition to model score (GFF col 6), makes taking log-oddds easy
- problems:
- minimum 1 column in each intergenic flanking region
- (not a problem, just a caveat) the intergenic end state penalizes inter
caf1screen_v13
Exactly the same as
caf1screen_v12, except using Rfam-trained probabilities.
Accessing the project
These instructions are specifically for the 12-fly screen.
See
HowToRunPipeline for the generic framework.
CVS
You can get it
Via ViewCVS or via CVS checkout from
cvs.biowiki.org:
export CVS_RSH=ssh
cvs -d:ext:cvs.biowiki.org:/usr/local/cvs co caf1screen
NFS
/nfs/projects/caf1screen/
Alot of stuff is easier to process without dealing with NFS overhead, so bypass NFS and access the RAID directly using:
ssh lorien
cd /home/projects/caf1screen/
Database
There are multiple databases (currently on
sheridan) corresponding to various screens (e.g. databases
caf1screen_vNN, where NN is the version number for a CAF1 fly screen described on this page).
ssh sheridan
mysql
# the MySQL command-line console should open
use caf1screen_vNN;
Data sources
TODO: make sure to have VERSIONS OF EVERYTHING
TODO: keep writing!
TODO: something to be said about
http://www.fruitfly.org/annot/release4.html
Mavid/Mercator
How the alignments were created (especially
heed the section on AGP Files, which describes how CAF1 scaffolds were assembled into Mercator scaffolds)
TODO: link to tools I've written to do the coordinate remapping for species affected by this (also potentiall for dsim and dyak whose scaffols were also messed with)
Coordinates are CAF1 (for
dmel, release 4).
Mercator coverage estimates
These are only for the 9 flies that were also assembled by Mercator (those that have
.agp files in
/nfs/data/genome/fly/align/12fly/mercator/).
The point was to figure out how many scaffolds were getting dropped during Mercator assembly, and what fraction of the genome they comprise.
| genome |
assembly size (nt) |
% assembly covered by Mercator maps |
| dana |
230,993,012 |
73.8 |
| dere |
152,712,140 |
83.3 |
| dgri |
200,467,819 |
69.7 |
| dmoj |
193,826,310 |
82.6 |
| dper |
188,374,079 |
80.5 |
| dpse |
152,738,921 |
91.1 |
| dsec |
166,577,145 |
74.6 |
| dvir |
206,026,697 |
81.4 |
| dwil |
236,693,122 |
76.5 |
Obtained by:
ssh lorien
cd /home/projects/12fly-analysis/misc/mercator-quality/
make all
cat lost.d*
Pecan/Mercator
TODO: just out of curiosity, how did Thomas Down engineer this alignment?
- Based on the same Mercator orthology segments by Colin Dewey as the Mavid/Mercator alignment, which means:
- sequences are CAF1 release (dmel release 4)
- half-open, 0-based coords in file names (just like in the Mercator map file)
- we can hack Mavid/Mercator tools (CVS project
mercator-perl) to work with this
- TODO: add PROJECTDIR/perl/make-pecan-like-mavid.pl to
mercator-perl
- Contains 4077 segments, which correspond to the first 4077 lines in /nfs/data/genome/fly/align/12fly/mercator/map;
- these are all euchromatin sequences containing dmel orthology
- dmel heterochromatin sequence (segments 4078-4136) was not aligned
- segments 4137-4684, which contained no orthology to dmel, were not alignment
Issues and weirdnesses
- Why are these empty files?
- 2R-pecan-CAF1/2R_10042054_10110527.mfa
- 3L-pecan-CAF1/3L_11048797_11055964.mfa
- There are 69 alignments that have exactly one tail (or maybe head) column missing, leading featurevole to produce errors like this:
-
# Alignment 1046: no. of chars (67800) in row DroAna_CAF1 doesn't match length (67801) of mapfile line 1046; invalidating map!
- This means featurevole will drop all hits in those segments, which for the time being is ACCEPTABLE because it is generally only tens of thousands (34K for caf1screen_v12) features out of about 2 million.
- The affected segments are:
-
9 36 163 260 399 419 430 438 458 628 650 808 809 945 1029 1090 1140 1155 1172 1369 1510 1515 1561 1711 1718 1768 1890 1904 1940 2139 2213 2227 2279 2319 2377 2471 2536 2585 2651 2821 2875 2884 2987 3081 3157 3168 3263 3356 3373 3410 3411 3413 3456 3492 3522 3588 3589 3599 3676 3710 3765 3766 3788 3874 3921 3958 3968 4053 4061
Transposons
From
RobertBradley.
See the
FlyBase page for notes on the data.
Affymetrix
The tiling array dataset showing transcrpition fragments (transfrags) for each of 12 timepoints in the first 24 hours of
dmel development.
Here is an excerpt from an e-mail by Sujit Dike of Affymetrix describing the file formats on the Affy site:
If you just want to use the release 4, apr 2004, UCSC genome ver dm2
then you can use the files as it is.
1. The tpmaps contain the following 9 fields:
Probe_Seq
f/t(u can ignore this)
Chr
Genomic position of the start of the probe
X coordinate of the PM probe on the chip
Y coordinate of the PM probe on the chip
X coordinate of the corresponding MM probe on the chip
Y coordinate of the correponding MM probe on the chip
Something related to feature size that you can ignore
2. The graphs files contain two fields
Genomic position of the start of probe
Signal Intensity as determined by the method described in the paper and
the references therein.
You can upload the graph files onto IGB to visualize the expression in
defferent points of development. Please note files are grouped based on
chr. In each chr directory, there are 4 files for each time point
assayed during development. Files that have T1/T2/T3 refer to technical
replicates and the C01 refers to the composite generated from the three
reps. There are 12 timepoints ( 1..12) that refer to RNA sample
collected from 0-2hrs,2-4hrs..22-24-hrs into development.
The dap_del and other del directories refer to graphs generated from the
bacterial negative control regions that were used to calculate false
positive rates. Please refer to the paper for details.
3. The transfrag files have two filds
Start
Stop
Of a region that is called expressed based on the parameters descibed in
the paper.
There are twelve files for each chr corresponding to the 12 timepoints
assayed as mentioned above.
Nimblegen
AAA:Gene_Validation
- why are there duplicate probes? (e.g.
grep -n ANA00P0000000019 OLIV-ana.v3.0.gff)
- explicit duplicate count is now maintained (see /nfs/projects/12fly-analysis/nimble/load-d*.log)
- these duplicates are included in the "# probes CAF1" column of the "remapping" table on AAA:Gene_validation
- where is the cutoff for getting OLIV-EXPR2 probes in the Validation results table on the AAA:Gene_validation page?
Relevant papers
TODO: move these into the sections (e.g. source data) that they are relevant to.
- Affymetrix developmental tiling array
- Piwi-RNAs & transposons
Prior art
- Pedersen JS, Bejerano G, Siepel A, Rosenbloom K, Lindblad-Toh K, Lander ES, Kent J, Miller W, Haussler D. Identification and classification of conserved RNA secondary structures in the human genome. PLoS Comput Biol. 2006 Apr;2(4):e33. Epub 2006 Apr 21.
- Washietl S, Hofacker IL, Stadler PF. Fast and reliable prediction of noncoding RNAs. Proc Natl Acad Sci U S A. 2005 Feb 15;102(7):2454-9. Epub 2005 Jan 21.
- Washietl S et al. Structured RNAs in the ENCODE selected regions of the human genome. Genome Res. 2007 Jun;17(6):852-64.
- Backofen R et al. RNAs everywhere: genome-wide annotation of structured RNAs. J Exp Zool B Mol Dev Evol. 2007 Jan 15;308(1):1-25.
-- Created by
AndrewUzilov on 27 Mar 2007