Twelve Fly Screen

From Biowiki
(Redirected from TwelveFlyScreen)
Jump to: navigation, search

---

This is a top-level overview of the project. See other pages for details. Twelve Fly Screen Predictions gives a short overview of our prediction sets.

See also

---

Selecting hits for experimental verification

Exact procedure

Specifically for caf1screen_v12/ncRnaDualStrand_v12.

We built a table (top_hitsAlign_ncRnaDualStrand_v12) from the top 5% (108,168) of our hits by requiring:

  1. length >= 20
    • dmel sequence length after gaps removed (not length in alignment columns)
  1. bp >= 5
    • number of base pairs in the annotation (not base pairs in dmel)
  1. covar >= 2
    • number of covariant columns as output by colorstock.pl
  1. seqs >= 3
    • number of sequences that are not 100% gaps in the hit annotation
  1. no overlap with a mRNA, cDNA or EST, ncRNA or pseudogene
  2. overlap with an Affy transfrag

Here's the SQL for it:

create table top_hitsAlign_ncRnaDualStrand_v12 select ha.* from hitsAlign_ncRnaDualStrand_v12 ha inner join
 filterGenomic_ncRnaDualStrand_v12_dmel fg on ha.id=fg.id inner join
 filterAlign_ncRnaDualStrand_v12 fa on fa.id=fg.id
where fg.length >= 20 and fg.ntInKnown = 0  and fg.ntInMRNA = 0 and fg.ntInClone = 0 and fg.ntInAffy > 0 and fg.ntInPseudo = 0
 and fa.rank <= 108168 and fa.bp >= 5 and fa.covar >= 2 and fa.seqs >= 3;

This gave us 359 hits. We then eye-balled each of these hits using colorstock.pl. We rejected 170 (field rejected), most commonly due to gappiness in Dmel (or other species), presence of simple repeats or poor stem structure. We assigned a gutscore (1 = passable, 2 = good, 3 = warm fuzzy feeling) to the others and took the top-scoring 100 to compile the prediction list attached to this email.

We compiled a list of the top 100 hits by ranking accepted hits on gutscore, then lgOdds.

We've also compiled some statistics on our hits. All statistics were computed using the top 5% of hits (ranked by log-odds).

  • 62.1% (67,163 of 108,168) overlap a mRNA
  • 68.7% (74,292 of 108,168) overlap a cDNA or EST
  • 35.2% (38,072 of 108,168) overlap an Affy transfrag
  • 0.69% (741 of 108,168) overlap a ncRNA
  • 0.01% (14 of 108,168) overlap a pseudogene
  • 1.9% (2056 of 108,168) overlap a 5' UTR
  • 3.5% (3,763 of 108,168) overlap a 3' UTR

Adjacent windows may generate overlapping hits. Duplicate hits with exactly same coordinates (i.e. matching start and end) were removed in the SGE-to-database step, but we may still have overlap.

We create a new table for non-overlapping hits only:

<a style="color: #bb1100;"># give it a very different name to prevent typos screwing up our original table</a>
CREATE TABLE uniq (id VARCHAR(32) NOT NULL, gutscore SMALLINT NOT NULL, PRIMARY KEY (id));

First, copy into it hits that don't overlap with any other hits:

INSERT INTO uniq
  SELECT
	 t1.id,
	 t1.gutscore
  FROM
	 <a style="color: #bb1100;"># this self-join finds all pairs of overlapping hits</a>
	 top_hitsAlign_ncRnaDualStrand_v12 AS t1
	 LEFT OUTER JOIN
	 top_hitsAlign_ncRnaDualStrand_v12 AS t2
	 ON
		t1.start <= t2.end
		AND
		t1.end >= t2.start
		AND
		t1.segment = t2.segment
		AND
		<a style="color: #bb1100;"># every hit overlaps itself, which is an artifact;
		this check will throw out hits that overlap themselves</a>
		t1.id <> t2.id
  WHERE
	 <a style="color: #bb1100;"># we only care about hits that were not rejected</a>
	 t1.gutscore >= 1
	 AND
	 <a style="color: #bb1100;"># this filters down to hits that don't overlap with anything;
	 # if a hit in t1 does not overlap any other hit, it will have
	 # a NULL in t2 due to us using a left outer join</a>
	 t2.id IS NULL
;

Second, copy hits that do overlap, selecting from the overlapping pair the hit with the higher gutscore:

INSERT INTO uniq
  SELECT
	 <a style="color: #bb1100;"># select the hit with the higher gutscore from the overlapping pair</a>
	 IF(t1.gutscore >= t2.gutscore,t1.id,t2.id) AS id,
	 IF(t1.gutscore >= t2.gutscore,t1.gutscore,t2.gutscore) AS gutscore
  FROM
	 <a style="color: #bb1100;"># this join finds all pairs of overlapping hits</a>
	 top_hitsAlign_ncRnaDualStrand_v12 AS t1
	 INNER JOIN
	 top_hitsAlign_ncRnaDualStrand_v12 AS t2
	 ON
		t1.start <= t2.end
		AND
		t1.end >= t2.start
		AND
		t1.segment = t2.segment
		AND
		<a style="color: #bb1100;"># every hit overlaps itself, which is an artifact;
		this check will throw out hits that overlap themselves</a>
		t1.id <> t2.id
		AND
		<a style="color: #bb1100;"># we only care about non-rejected hits overlapping with non-rejected hits</a>
		t1.gutscore IS NOT NULL
		AND
		t2.gutscore IS NOT NULL
		AND
		<a style="color: #bb1100;"># when hit A overlaps hit B, two intersections will be produced:
		# (A,B) and (B,A); this is redundant, as it is the same intersection,
		# so we keep only the intersect where hit1.start < hit2.start;
		# note that we reject the opposite case by return "0" from the
		# "else" of the conditional, which will make the entire join
		# conditional fail</a>
		IF(t1.start < t2.start,
		  1,
		  IF(t1.start = t2.start,
			 <a style="color: #bb1100;"># if starts are the same, sort on end</a>
			 IF (t1.end < t2.end,
				1,
				IF(t1.end = t2.end,
				  <a style="color: #bb1100;"># if end is the same, the hits must have same coords but are
				  # on different strands; we arbitrarily keep the one on the
				  # plus strand because selecting the higher-scoring one is
				  # (1) pointless, since coords are the same, and (2) would add
				  # too much complexity just to take care of a single case</a>
				  IF (t1.strand = '+', 1, 0),
				  0
				)
			 ),
			 0
		  )
		)
ON DUPLICATE KEY UPDATE
  <a style="color: #bb1100;"># if hit A overlaps hit B and hit C and has a higher gutscore than either,
  # then hit A will be inserted into "uniq" for the (A,B) intersection,
  # then inserted again for the (A,C) intersection, leading to a duplicate
  # key error and failing the query; to prevent this, we update duplicate
  # keys with the one with a higher gutscore (the hit is exactly the same,
  # but we should keep the more favorable opinion of it)</a>
  uniq.id = uniq.id,
  uniq.gutscore = IF(uniq.gutscore >= t1.gutscore,uniq.gutscore,t1.gutscore)
;

Now add the columns we will want for the GFF dump:

ALTER TABLE uniq
ADD COLUMN (
  segment INT UNSIGNED,
  start INT UNSIGNED,
  end INT UNSIGNED,
  strand CHAR(1),
  lgOdds FLOAT
)
;
UPDATE			  
  uniq AS u,
  hitsAlign_ncRnaDualStrand_v12 AS a 
SET
  u.segment = a.segment,
  u.start = a.start,
  u.end = a.end,
  u.strand = a.strand,
  u.lgOdds = a.lgOdds
WHERE
  u.id = a.id
;

Use alignmonkey.pl to get the sequences for our hits, because unlike get-hit-alignments.pl it will revcomp sequence for hits on the minus strand:

cd /nfs/projects/caf1screen/gff/caf1screen_v12/
# get everything with a high gutscore
/nfs/projects/pipeline/perl/dump-to-gff.pl -h sheridan -d caf1screen_v12 -t uniq --seqid segment --start start --end end --strand strand --score lgOdds type=ncRNA --sql "WHERE gutscore >= 2" > for-experiment.ncRnaDualStrand_v12.gff
# how many was that?
grep -cv ^# for-experiment.ncRnaDualStrand_v12.gff
# for those with a low gutscore, sort by lgOdds, appending only enough to the list to make 100 hits
/nfs/projects/pipeline/perl/dump-to-gff.pl -h sheridan -d caf1screen_v12 -t uniq --seqid segment --start start --end end --strand strand --score lgOdds type=ncRNA --sql "WHERE gutscore = 1 ORDER BY lgOdds DESC" | grep -v ^# | head -n 22 >> for-experiment.ncRnaDualStrand_v12.gff
# alignmonkey need kosh for its memory
ssh kosh
/nfs/src/mercator-perl/alignmonkey.pl -gff for-experiment.ncRnaDualStrand_v12.gff -align segment.stock -mavid /nfs/projects/caf1screen/sge/caf1screen_v12/in/ > for-experiment.ncRnaDualStrand_v12.stock
# convert to FASTA
perl -e 'use strict; use warnings; use Stockholm::Database; my $in = Stockholm::Database->from_file ("for-experiment.ncRnaDualStrand_v12.stock"); foreach my $stk (@$in) {my ($id) = $stk->get_gf ("GFF") =~ /id=(\w+[+-])/; my $seq = $stk->seqdata->{DroMel_CAF1}; $seq =~ s/-//g; print ">$id\n$seq\n"}' > for-experiment.ncRnaDualStrand_v12.fasta

The uniq table doesn't have secondary structure information or window coordinates, so we get those by giving dump-to-gff.pl a join instead of just a table; e.g.:

/nfs/projects/pipeline/perl/dump-to-gff.pl -h sheridan -d caf1screen_v12 -t "uniq u inner join hitsAlign_ncRnaDualStrand_v12 h on u.id=h.id" --seqid segment --start start --end end --strand strand --score lgOdds type=ncRNA --sql "WHERE gutscore >= 2" > for-experiment-ss.ncRnaDualStrand_v12.gff
/nfs/projects/pipeline/perl/dump-to-gff.pl -h sheridan -d caf1screen_v12 -t "uniq u inner join hitsAlign_ncRnaDualStrand_v12 h on u.id=h.id" --seqid segment --start start --end end --strand strand --score lgOdds type=ncRNA --sql "WHERE gutscore = 1 ORDER BY u.lgOdds DESC" | grep -v ^# | head -n 22 >> for-experiment-ss.ncRnaDualStrand_v12.gff

This is the preferred way to do things (so that the gffs contain all the desired info).

Troubleshooting the SQL

This applies only to the table caf1screen_v12.top_hitsAlign_ncRnaDualStrand_v12. Later genome screens will yield different special case hits, but they should fall in the same types of special cases, so no extra SQL will need to be written... hopefully.

When you're changing the above self-intersect SQL, make sure the following hits get done correctly. They were the most troublesome ones that created the need for special checking.

  • 2012_827708_827828_+ and 2012_827708_827826_-
    • same start coord, so choosing on smaller start to fix (A,B) vs (B,A) problem doesn't work - that's why we introduced checking on end coord and, failing that, picking the plus strand hit
  • 3057_192927_193052_+ and 3057_192927_193052_-
    • same coordinates, except different strands
    • also has a gutscore of 1 in one window (minus strand), gutscore of 3 in another (plus strand); ideally we should pick the higher one, but arbitrarily we pick the plus strand (lucky in this case, but this is not guaranteed)
  • 2866_223529_223635_+ and 2866_223597_223635_+ and 2866_223608_223631_+
    • the only case of 3 hits that all overlap each other (producing 6 intersections == 3 non-redundant intersections), and also have 3 different gut scores
    • 2866_223529_223635_+ has the highest gut score, so make sure that is the only one that's kept

How to just view a list of overlapping hit pairs (for debugging):

SELECT
  t1.id,t1.gutscore,t2.id,t2.gutscore
FROM
  top_hitsAlign_ncRnaDualStrand_v12 AS t1
  INNER JOIN
  top_hitsAlign_ncRnaDualStrand_v12 AS t2
  ON
	 t1.start <= t2.end
	 AND
	 t1.end >= t2.start
	 AND
	 t1.segment = t2.segment
	 AND
	 t1.id <> t2.id
	 AND
	 t1.gutscore IS NOT NULL
	 AND
	 t2.gutscore IS NOT NULL
;

Looking at the hits in GBrowse

Here's how to get the chromosome names and genomic coordinates of the top hits in order to check them out in FlyBase GBrowse:

select map.genbank_name, hg.start, hg.end from top_hitsAlign_ncRnaDualStrand_v12 top inner join
 hitsGenomic_ncRnaDualStrand_v12_dmel hg on top.id = hg.id inner join
 caf1screen.map_dmel_scaffold_names map on hg.refseq = map.genbank_accnVer
where top.gutscore > 2;

Alternately, you could just take the sequence and use FlyBase BLAST, then click on BLAST HIT on Genome Map, which is faster if you have just one sequence.

---

Computational genome screens

Data for each screen is stored in a database with the same name as the screen (e.g. caf1screen_v12).

Summary

Unless otherwise noted, all screens below use:

  • intial tree from /nfs/data/genome/fly/tree/median.nh
    • branch lengths then adjusted by xrate for each window
  • the grammar is in /nfs/projects/12fly-analysis/grammars/

<noautolink>

screen SGE done alignment source model window size/step pre-filtering other notes
caf1screen_v00 TWikiDocGraphics.choice-yes.gif Mavid/Mercator ncRnaDualStrand 120/40 RNAz heuristics first crack at a screen
caf1screen_v01 TWikiDocGraphics.choice-yes.gif Mavid/Mercator ncRnaDualStrand 120/40 windowlicker: -g 0.25 -b 0 -r Dro Mel_CAF1 replay of v00 with slightly less filtering, using new pipeline tools (e.g. windowlicker)
caf1screen_v02 TWikiDocGraphics.choice-yes.gif Pecan/Mercator ncRnaDualStrand 120/40 windowlicker: -g 0.25 -b 0 -r Dro Mel_CAF1
caf1screen_v02.unfiltered TWikiDocGraphics.choice-yes.gif Pecan/Mercator ncRnaDualStrand 120/40 none
caf1screen_v03 TWikiDocGraphics.choice-yes.gif Mavid/Mercator ncRnaDualStrand_force1hit 120/40 none analysis POSTPONED in favor of caf1screen_v12
caf1screen_v04 N/A Pecan/Mercator ncRnaDualStrand_force1hit 120/40 none CANCELED in favor of other screens
caf1screen_v05 N/A Mavid/Mercator ncRnaDualStrand 300/100 none CANCELED in favor of other screens
caf1screen_v06 N/A Pecan/Mercator ncRnaDualStrand 300/100 none CANCELED in favor of other screens
caf1screen_v07 4578 out of 4684 jobs done Mavid/Mercator ncRnaDualStrand_Rfam 120/40 none POSTPONED/FORSAKEN since Pecan is turning out better
caf1screen_v08 TWikiDocGraphics.choice-yes.gif Pecan/Mercator ncRnaDualStrand_Rfam 120/40 windowlicker: -g 0.25 -b 0 -r Dro Mel_CAF1
caf1screen_v09 N/A Mavid/Mercator ncRnaDualStrand_eqProb 120/40 equal intergenic/ncRNA transition probs; CANCELED, partial run showed it to be total crap (>95% columns marked as ncRNA)
caf1screen_v10 N/A Pecan/Mercator ncRnaDualStrand_eqProb 120/40 equal intergenic/ncRNA transition probs; CANCELED, partial run showed it to be total crap (>95% columns marked as ncRNA)
caf1screen_v11 skipping
caf1screen_v12 TWikiDocGraphics.choice-yes.gif Pecan/Mercator ncRnaDualStrand_v12 300/100 windowlicker: -g 0.9 -b 0 -r Dro Mel_CAF1 Inside score for Pfold subgrammar, force 1 hit
caf1screen_v13 TWikiDocGraphics.choice-yes.gif Pecan/Mercator ncRnaDualStrand_v13 300/100 windowlicker: -g 0.9 -b 0 -r Dro Mel_CAF1 just like v12, except using Rfam-trained params

</noautolink>

Screen details

caf1screen_v00

See Twelve Fly Screen Archive.

caf1screen_v01

A rerun of caf1screen_v00 that fixes some of its annoyances (i.e. losing 3 flies due to wrong tree, throwing away a lot of segments, filtering out many of the windows). This will put this screen's data on an equal level with subsequent screens, allowing us to judge quality without bias from early, naive errors.

Another reason for this re-run is that this one uses the standardized, elegant Makefile.sge and windowlicker.pl, whereas v00 used ad hoc, throw-away scripts and other non-reproducible crap.

  • Mavid/Mercator alignment
    • all 4684 segments
  • model: dart/grammars/ncRnaDualStrand.eg
  • 120nt windows, step 40nt
    • no windows filtered
  • tree: /nfs/data/genome/fly/tree/median.nh
    • branch lengths adjusted for each window

caf1screen_v12

  • Pecan/Mercator alignment
  • gap threshold: windows >90% gappy in dmel get dropped
  • window size/step: 300/100
  • model: /nfs/projects/12fly-analysis/grammars/ncRnaDualStrand_v12.eg
    • forces exactly 1 hit
    • minlen 20
  • max non-suffix length = 130
  • sum over Pfold states instead of max, getting Inside score
  • intergenic null model score produced by xrate (see for lgInside(FWD_INTERGENIC) and lgInside(REV_INTERGENIC) in GFF col 9) in addition to model score (GFF col 6), makes taking log-oddds easy
  • problems:
    • minimum 1 column in each intergenic flanking region
    • (not a problem, just a caveat) the intergenic end state penalizes inter

caf1screen_v13

Exactly the same as caf1screen_v12, except using Rfam-trained probabilities.

---

Accessing the project

These instructions are specifically for the 12-fly screen. See How To Run Pipeline for the generic framework.

CVS

You can get it Via ViewCVS or via CVS checkout from cvs.biowiki.org:

export CVS_RSH=ssh
cvs -d:ext:cvs.biowiki.org:/usr/local/cvs co caf1screen

NFS

/nfs/projects/caf1screen/

Alot of stuff is easier to process without dealing with NFS overhead, so bypass NFS and access the RAID directly using:

ssh lorien
cd /home/projects/caf1screen/

Database

There are multiple databases (currently on sheridan) corresponding to various screens (e.g. databases caf1screen_vNN, where NN is the version number for a CAF1 fly screen described on this page).

ssh sheridan
mysql
# the [[My SQL]] command-line console should open
use caf1screen_vNN;

---

Data sources

TODO: make sure to have VERSIONS OF EVERYTHING

TODO: keep writing!

TODO: something to be said about http://www.fruitfly.org/annot/release4.html

Mavid/Mercator

How the alignments were created (especially heed the section on AGP Files, which describes how CAF1 scaffolds were assembled into Mercator scaffolds)

TODO: link to tools I've written to do the coordinate remapping for species affected by this (also potentiall for dsim and dyak whose scaffols were also messed with)

Coordinates are CAF1 (for dmel, release 4).

Mercator coverage estimates

These are only for the 9 flies that were also assembled by Mercator (those that have .agp files in /nfs/data/genome/fly/align/12fly/mercator/). The point was to figure out how many scaffolds were getting dropped during Mercator assembly, and what fraction of the genome they comprise.

genome assembly size (nt) % assembly covered by Mercator maps
dana 230,993,012 73.8
dere 152,712,140 83.3
dgri 200,467,819 69.7
dmoj 193,826,310 82.6
dper 188,374,079 80.5
dpse 152,738,921 91.1
dsec 166,577,145 74.6
dvir 206,026,697 81.4
dwil 236,693,122 76.5

Obtained by:

ssh lorien
cd /home/projects/12fly-analysis/misc/mercator-quality/
make all
cat lost.d*

Pecan/Mercator

TODO: just out of curiosity, how did Thomas Down engineer this alignment?

  • Based on the same Mercator orthology segments by Colin Dewey as the Mavid/Mercator alignment, which means:
    • sequences are CAF1 release (dmel release 4)
    • half-open, 0-based coords in file names (just like in the Mercator map file)
    • we can hack Mavid/Mercator tools (CVS project mercator-perl) to work with this
      • TODO: add PROJECTDIR/perl/make-pecan-like-mavid.pl to mercator-perl
  • Contains 4077 segments, which correspond to the first 4077 lines in /nfs/data/genome/fly/align/12fly/mercator/map;
    • these are all euchromatin sequences containing dmel orthology
      • dmel heterochromatin sequence (segments 4078-4136) was not aligned
      • segments 4137-4684, which contained no orthology to dmel, were not alignment

Issues and weirdnesses

  • Why are these empty files?
    • 2R-pecan-CAF1/2R_10042054_10110527.mfa
    • 3L-pecan-CAF1/3L_11048797_11055964.mfa
  • There are 69 alignments that have exactly one tail (or maybe head) column missing, leading featurevole to produce errors like this:
    • # Alignment 1046: no. of chars (67800) in row Dro Ana_CAF1 doesn't match length (67801) of mapfile line 1046; invalidating map!
  • This means featurevole will drop all hits in those segments, which for the time being is ACCEPTABLE because it is generally only tens of thousands (34K for caf1screen_v12) features out of about 2 million.
  • The affected segments are:
    • 9 36 163 260 399 419 430 438 458 628 650 808 809 945 1029 1090 1140 1155 1172 1369 1510 1515 1561 1711 1718 1768 1890 1904 1940 2139 2213 2227 2279 2319 2377 2471 2536 2585 2651 2821 2875 2884 2987 3081 3157 3168 3263 3356 3373 3410 3411 3413 3456 3492 3522 3588 3589 3599 3676 3710 3765 3766 3788 3874 3921 3958 3968 4053 4061

Transposons

From Robert Bradley.

Fly Base

See the Fly Base page for notes on the data.

Affymetrix

The tiling array dataset showing transcrpition fragments (transfrags) for each of 12 timepoints in the first 24 hours of dmel development.

Here is an excerpt from an e-mail by Sujit Dike of Affymetrix describing the file formats on the Affy site:

If you just want to use the release 4, apr 2004, UCSC genome ver dm2 then you can use the files as it is.

1. The tpmaps contain the following 9 fields:

Probe_Seq f/t(u can ignore this) Chr Genomic position of the start of the probe X coordinate of the PM probe on the chip Y coordinate of the PM probe on the chip X coordinate of the corresponding MM probe on the chip Y coordinate of the correponding MM probe on the chip Something related to feature size that you can ignore

2. The graphs files contain two fields Genomic position of the start of probe Signal Intensity as determined by the method described in the paper and the references therein. You can upload the graph files onto IGB to visualize the expression in defferent points of development. Please note files are grouped based on chr. In each chr directory, there are 4 files for each time point assayed during development. Files that have T1/T2/T3 refer to technical replicates and the C01 refers to the composite generated from the three reps. There are 12 timepoints ( 1..12) that refer to RNA sample collected from 0-2hrs,2-4hrs..22-24-hrs into development. The dap_del and other del directories refer to graphs generated from the bacterial negative control regions that were used to calculate false positive rates. Please refer to the paper for details.

3. The transfrag files have two filds Start Stop Of a region that is called expressed based on the parameters descibed in the paper. There are twelve files for each chr corresponding to the 12 timepoints assayed as mentioned above.

Nimblegen

AAA:Gene_Validation

  • why are there duplicate probes? (e.g. grep -n ANA00P0000000019 OLIV-ana.v3.0.gff)
    • explicit duplicate count is now maintained (see /nfs/projects/12fly-analysis/nimble/load-d*.log)
    • these duplicates are included in the "# probes CAF1" column of the "remapping" table on AAA:Gene_validation
  • where is the cutoff for getting OLIV-EXPR2 probes in the Validation results table on the AAA:Gene_validation page?

---

Relevant papers

TODO: move these into the sections (e.g. source data) that they are relevant to.

Prior art

---

-- Created by Andrew Uzilov on 27 Mar 2007