Twelve Fly Roc Curve Estimation

From Biowiki
Jump to: navigation, search

ROC curves for RNA gene prediction in Drosophila

This page is a work in progress

A ROC curve is a plot of sensitivity versus specificity for a score-based classification tool.

The following page describes the common procedures for generating such ROC curves that were used for the following experiments:

  1. Comparison of RNA gene prediction grammars (Twelve Fly Screen Predictions);
    • For this experiment, multiple different xrate-format grammars were evaluated on a single synthetic dataset.
  1. Comparison of simulation tools for genome evolution (Simulation Tools).
    • For this experiment, a single xrate-format grammar was evaluated on multiple different synthetic datasets.

Generation of simulated data

  1. For the comparison of RNA gene prediction grammars (Twelve Fly Screen Predictions):
    • A single synthetic dataset was generated with SIMGENOME.
  1. For the comparison of simulation tools for genome evolution (Simulation Tools):
    • Multiple synthetic datasets were generated with GSIMULATOR (using a variety of different settings) and SIMGENOME.

The GSIMULATOR and SIMGENOME programs were parameterized by maximum likelihood from real Drosophila genome alignments as follows:

  1. A training dataset was assembled by taking a random 1% of the Pecan alignments of twelve Drosophila genomes (Clark et al. 2007).
  2. The XRATE program, which includes an algorithm for maximum likelihood estimation of substitution rates from alignments, was used to estimate substitution rates for the different submodels which comprise the SIMGENOME grammar for simulating syntenic blocks of genome alignments. All submodels described below were trained with XRATE (where relevant, the actual grammar files are attached):
    • neutral: Single-nucleotide, strand-symmetric model of neutral evolution. Gaps are modeled as a 5th character. [ null_nuc_symmetric_gapsub_trained.eg]
    • conserved: Single-nucleotide, strand-symmetric ungapped model of evolution of conserved regions. Substitution rates were estimated from the same training data as for the neutral model and then scaled by a factor of 1/10. [ caf1screen/training/null_nuc_symmetric_trained.eg]
    • coding: Codon-substitution model of coding sequence. The codon frequencies and substitution rates are taken from the empirically-estimated model of Kosiol, Holmes and Goldman 2007. The rates were modified by hand to allow for frame-preserving indels (any codon can mutate to all gaps at a fixed rate). This was made into a gene model by modeling UTRs and splice sites as conserved regions and introns as neutral regions with optional conserved regions within.
    • pseudogene: Only the coding region is modeled (no UTRs or introns). Codons are generated from the codon frequencies estimated in Kosiol, Holmes and Goldman 2007, then evolved as triplets of nucleotides, each of which evolves independently as under the neutral model.
    • ncrna: A model of constrained structural evolution of ncRNAs. Gaps are modeled as a 5th character. [ ncRna_v24.eg]
    • transposon: A model of DNA transposons with terminal inverted repeats (TIRs). TIR dinucleotide frequencies and substitution rates are identical to those of base-pairs in ncrna. Inner transposon sequence is modeled as neutral, with the exception of optional pseudogenes.

The empirical abundance of these genomic features in D. melanogaster, as reported by the following sources, was used to estimate appropriate parameters for the SIMGENOME grammar:

  1. Fly Base 4.3 annotation.
  2. "The transposable elements of the Drosophila melanogaster euchromatin: a genomics perspective" (2002), Genome Biology.
  3. "Evolution of genes and genomes on the Drosophila phylogeny" (2007), Nature 450, 203-218.
  4. "Minos as a Genetic and Genomic Tool in Drosophila melanogaster" (2005), Genetics 171(2), 571-581.
  5. "Minos, a new transposable element from Drosophila hydei, is a member of the Tc1-like family of transposons" (1991), Nucleic Acids Res., 19(23):6646.

SIMGENOME was run 20 times to generate simulated alignments as

simgenome.pl -rnd -d /nfs/src/dart -s /nfs/src/gsimulator 
	-rt trained-models/branch_model_singlet_2_12flymod100-pecan -bt trained-models/branch_model_gotoh_mixture_2_12flymod30-pecan-95 median.nh

Here is a detailed explanation of these options:

  1. -rnd: Seed the random number generator with the current time (to ensure good simulations)
  2. -d /nfs/src/dart: Path to DART (so that SIMGENOME can call the program simgram
  3. -s /nfs/src/gsimulator: Path to the XRATE grammar file describing the evolution of syntenic blocks of the genome.
  4. -rt trained-models/branch_model_singlet_2_12flymod100-pecan: Relative path to the GSIMULATOR transducer used to generate ancestral sequence.
  5. -bt trained-models/branch_model_gotoh_mixture_2_12flymod30-pecan-95: Relative path to the GSIMULATOR transducer used to evolve sequence (ancestor -> descendant).
  6. median.nh: The phylogenetic tree.

more here: statistics of synthetic alignments (size, length, etc.)

Gene prediction grammar

  1. For the comparison of RNA gene prediction grammars (Twelve Fly Screen Predictions):
  1. For the comparison of simulation tools for genome evolution (Simulation Tools):

"True positive" structural RNAs

We generated a dataset of "true positives" (known ncRNAs) as follow. We used Fly Base release 5.4 to construct a list of all D. melanogaster sequence annotated as ncRNA (these includes sequence annotated as one of "ncRNA tRNA rRNA miRNA snoRNA snRNA" in the Fly Base GFF file, where "ncRNA" is a catch-all for ncRNAs which do not fall into one of the other categories). We mapped these ncRNA annotations, which are in release 5 coordinates, back to the release 4 coordinates which were used to generate alignments of all twelve Drosophila genomes, and then extracted subalignments of the Pecan alignments (including flanking regions around the annotated ncRNA). The final dataset contained subalignments of twelve Drosophila genomes, where each subalignment contained a single ncRNA annotated in D. melanogaster in Fly Base.

Window-based scanning procedure

The WINDOWLICKER and XRATE programs were used to scan the alignments. We ran on each segment (7.stock) of the multiple alignment as follows:

addtree.pl median.nh 7.stock >7.withtree.stock
windowlicker.pl -w 300 -d 100 -c 100 -gff 7.gff -gr 7 -r [[Dro Mel]]_CAF1 -g 0.9 -b 0 -x /nfs/src/dart/bin/xrate segment.withtree.stock 
	-- -s -l 130 -e $(JUKES_CANTOR) -g $(GRAMMARS)/$(word 2,$(subst .,,$*)).eg > segment.xrate 2>segment.xrate-err

Here is a detailed explanation of these commands and options:

  1. The tree median.nh was added to the alignment with the Perl script [ addtree.pl].
  2. -w 300: Break the input alignment segment.stock into overlapping windows of length 300 nucleotides; annotate each separately with XRATE
  3. -d 100: Use a step size of 100 nucleotides when creating windows (so adjacent windows overlap by 200 nucleotides)
  4. -c 100: Run 100 windows per XRATE process (this "chunking" of windows helps to reduce system overhead)
  5. -gff 7.gff: Write special information output by XRATE as a GFF file (such as the probability that a particular subsequence was produced by the INTERGENIC state)
  6. -gr 7: Name to use in the first column of the output gff file
  7. -r DroMel_CAF1: Reference sequence (D. melanogaster) in input alignment
  8. -g 0.9: Maximum permitted gap fraction in the reference sequence; window is not annotated if this fraction is exceeded
  9. -b 0: Minimum information content (number of bits per dinucleotide) of the reference sequence; winow is not annotated if this limit is not met (this is intended to filter out repetitive sequence)
  10. -x /nfs/src/dart/bin/xrate: Path to the XRATE executable file
  11. -- -s -l 130 -e /nfs/src/dart/grammars/jukes_cantor.eg -g ncRna_v22.eg: These are arguments which WINDOWLICKER uses when calling XRATE to annotate each window.
    1. -s: Report probabilities as calculated by the Inside algorithm (summing over all possible RNA structures).
    2. -l 130: Consider only structures of maximum length 130 nucleotides.
    3. -e /nfs/src/dart/grammars/jukes_cantor.eg: Use the Jukes-Cantor substitution model to re-estimate branch lengths of the phylogenetic tree (done independently for each window).
    4. -g ncRna_v22.eg: Annotate the window using the ncRna_v22.eg grammar.

Generation of ROC plots

We ran the WINDOWLICKER and XRATE programs as described above on both the simulated alignments and the dataset of "true positives." This generated a single best-scoring predicted structural RNA with an associated score for every window of the multiple alignment. We calculated the log-odds ratio that the predicted structural RNA was in fact a true ncRNA as log ( P(ncRNA)/P(intergenic) ), using the probabilities P(ncRNA) and P(intergenic) returned by XRATE in the GFF output file. For each possible discovery threshold for the log-odds ratio, we estimated a point on the ROC curve as:

  1. false-positive rate: We calculated the fraction of windows of simulated data which we annotated as containing a true ncRNA.
  2. sensitivity: We calculated the number of ncRNAs in our "true positive" dataset which we annotated as a true ncRNA.

We used the R statistics package to plot this data and thereby create ROC plots.

more here, including Makefiles, R or Perl scripts, if appropriate

Example ROC plots

The following example ROC plots specifically illustrate the results of the above-described procedure.

more here: a ROC plot from each paper