Twelve Fly Screen Predictions

From Biowiki
Revision as of 17:04, 1 December 2008 by Ian Holmes (talk | contribs) (Imported from TWiki)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Structured RNA predictions for Drosophila melanogaster

Brief description of methodology

The following page contains links to our predictions of conserved structured RNA features in Drosophila. The predictions were made by sliding a window across multiple sequence alignments of the twelve Drosophila genomes discussed e.g. by Clark et al.: Evolution of genes and genomes on the Drosophila phylogeny. Nature 2007;450:203-18. (2007).

Our methodology was as follows. We used multiple alignments built with the PECAN program. We then used the XRATE phylogrammar engine with a custom-designed phylogrammar ("ClosingBp") to predict and score conserved structured RNA-like features, using the WINDOWLICKER xrate wrapper script to slide a 300-nucleotide window across the alignments (in 100-nucleotide steps), and making a prediction in every window.

Finally, we ranked the predictions by log-odds score, taking only the top 5% (corresponding to a score cutoff of 21.49 bits); performed some filtering operations (described below); and took the top 100 filtered hits. We also provide versions of our prediction sets that are separated by genomic region.

Overview of file formats

The following file formats are offered for all prediction sets:

  • GFF format file of D.melanogaster co-ordinates for predictions (FlyBase release 5.4 coordinates)
  • FASTA format file of D.melanogaster sequences for predictions
  • Stockholm format file of PECAN subalignments across the twelve species

Some of these are compressed using Wikipedia:Gzip.

Our predictions will soon be viewable through a GBrowse instance.

"Top 100" filtered predictions

Located in intergenic sequence: no overlap with annotated genes (ncRNA or protein-coding), transposons or pseudogenes in FlyBase. Overlap with a transcriptional fragment ("Transfrag") as reported by Manak et al.: Biological function of unannotated transcription during the early development of Drosophila melanogaster. Nat. Genet. 2006;38:1151-8.. Additional filtering described here.

(Nomenclature aside, these files actually contain only 98 predictions. These are planned to be our first targets for experimental validation by sequencing)

All predictions, by genomic region

(Filtered: 15,539 predicted features. Filtering procedure described here)

(Unfiltered: 56,092 predicted features)

Infernal screen of unique intergenic hits



Stemloc Clustering of intergenic hits

Note: these can take a while to render on first viewing. Loop lengths seem short in general, rerunning with longer minlength constraints in the grammar may provide more realistic results.

The data

The PECAN and MAVID alignments used in our analysis can be downloaded from the following URL:

The method

We used the generic phylo-grammar engine XRATE to conduct a whole-genome screen of twelve Drosophila genomes (using Window Licker together with an xrate grammar we call ClosingBp).

The models

We compared eight different grammars and two alignment tools before arriving at the final grammar used, ClosingBp.

ROC plots of sensitivity vs predicted specificity (see paper) were generated using the following grammars:

  1. Pfold (ncRna_v17): original PFOLD rates; single-nucleotide (context-independent) null model of intergenic sequence
  2. Dinuc (ncRna_v18): original PFOLD rates; dinucleotide (nearest-neighbor context dependence) null model
  3. PfoldRetrained (ncRna_v19): mix80-trained rates; single-nucleotide null model
  4. ClosingBp (ncRna_v22): mix80-trained rates; separate substitution rate matrix for the closing base-pair of stems; single-nucleotide null model. This is the model that was used for the prediction sets linked on this page.
  5. SymmetricStemGaps (ncRna_v21): original PFOLD rates; gaps in stems permitted only if in both nucleotides of the base-pair are gaps; single-nucleotide null model
  6. NoStemGaps (ncRna_v20): original PFOLD rates; no gaps in stems; single-nucleotide null model
  7. GapLinks (ncRna_v23): mix80-trained rates; approximate TKF92-based (Thorne et al. 1992) ``links models for runs of gaps in stems, loops and intergenic sequence; % single-nucleotide null model
  8. GapSub (ncRna_v24): mix80-trained rates; gaps are treated as a fifth character in both ncRNA and intergenic sequence; single-nucleotide null model

Synthetic datasets were generated using GSIMULATOR.

Our selection criteria

"Filtered" predictions were required to satisfy the following criteria:

  1. Conserved structures were required to include at least ten base-paired columns, at least two of which had to display compensatory mutations.
  2. Alignment segments predicted to contain conserved RNA structure were discarded unless they contained at least 20 bases of melanogaster sequence as well as sequence from at least four other species with gaps in no more than 7.5% of predicted base-pairs.

Predictions which overlapped by more than 80% were resolved by retaining the highest-scoring prediction and discarding the other(s).

See our methods paper for more information.


Bradley RK, Uzilov AK, Skinner M, Bendana YR, Varadarajan A, Holmes I. "Non-Coding RNA Gene Predictors in Drosophila." Submitted.

Please contact Ian Holmes, Robert Bradley or Mitch Skinner with questions.

For internal use only: ClosingBp was uploaded on 3/6/08.

-- Robert Bradley - 24 Oct 2007