Twelve Fly Scan

From Biowiki
Jump to: navigation, search

This page has been superceded by Twelve Fly Screen

Tasks

  • TWikiDocGraphics.choice-yes.gif Divide Mavid alignment into windows. (Andrew - see Fly Mavid Windows for the procedure)
  • TWikiDocGraphics.choice-yes.gif Develop scripts to submit genome scan to cluster.
    • To avoid overloading the cluster, each job will do 10k windows. This gives about 300 jobs.
  • TWikiDocGraphics.choice-yes.gif Do scan using RNAz.
  • Develop ncRna grammar
    • TWikiDocGraphics.choice-yes.gif Start with pfold grammar and rates (dart/data/pfold.eg). Add a bifurcation to an intergenic state. Output is ncRna.eg.
    • Training of grammar
      • TWikiDocGraphics.choice-yes.gif Intergenic sites

* Create intergenic alignments and train intergenic state separately. * Use jukescantor.eg to estimate the tree. Use the dart/data/nullrna.eg grammar as a template to estimate parameters. Output is intergenic.eg.

      • TWikiDocGraphics.choice-yes.gif Genic sites

* Create set of Drosophila ncRna alignments. Use this to train a state representing a ncRna gene. * Extract Mavid alignments corresponding to Flybase ncRna annotations. Filtered out columns >90% gaps ('N' is counted as a gap). * Use jukescantor.eg to estimate the tree. Use the nullrna.eg grammar as a template to estimate parameters. Output is ncRnaGene.eg.

      • TWikiDocGraphics.choice-yes.gif Transition probabilities

* Add flanking region to ncRna alignments. Flanking region is same size as ncRna gene, half on each side. * Annotate for intergenic state and ncRna state. * Create a version of ncRna.eg called ncRnaTrans.eg which has only two chains, one for genic sites and one for intergenic sites. Use the previously trained rates and set update-rates attribute to 0 (update-rules is 1). Replace the pfold subgrammar with a single genic state. Use this to train transition probabilities. Output is ncRnaTransOut.eg.

  • TWikiDocGraphics.choice-yes.gif Do scan using xrate.
  • Modify ncRna grammar to have forward and reverse subgrammars.
  • Develop coding region grammar.
  • Train ncRna grammar on curated Drosophila ncRNA dataset.
  • Create grammar incorporating 3 subgrammars of rna, prot, and null.
  • Redo scan using the combined grammar.
  • Develop more grammars. Loop.

Grammar

Starting with the Pfold grammar, a new bifurcation state B was added as well as a new start state E and intergenic state N.

  • E -> N | B | End (Start state can go to intergenic or B or terminate)
  • B -> S E (B bifurcates to the Pfold S state to annotate an ncRna and back to the start state.)
  • N -> nN* (annotate intergenic site)
  • N* -> N | B | End (can stay in intergenic state or begin a ncRna or end)

Windows

For creating the sliding windows, we basically followed the RNAz genome scan model. We wanted windows which would work reasonably well with both xrate and RNAz. Windows are located in /mnt/nfs/data/genome/fly/windows. Workflow (see windows/NOTES.TXT):

  • Ran perl/make-windows.pl (see windows/*.gff for parameter values). Output is in fly/windows.etc/make-windows.log.
  • Ran dart/perl/drop-gappy-columns.pl for cols that are 100% gappy.
  • Ran rnazSelectSeqs.pl (RNAz script) to optimize alignment for RNAz.

For the future, it might be better to create a GenomeScan object which will load an alignment segment and iterate thru the windows in memory. It would have attributes to handle filtering the alignment and keeping track of the indices.

Training set

  • Drosophila intergenic dataset
    • Use UCSC table browser to download Multiz alignments and filter for intergenic regions. This didn't work out because annotations aren't up to date. Got back alignments in regions with Dmel gene annotations.
    • Start with Flybase gene annotations for Dmel, v4.3.
      • Compute a merged gff set of intersections.
      • Compute a set of regions that lie in between these intersecting sets.
      • For regions at are >n bases long (initially set n=1000), extract Mavid alignment and compute %gaps for each seq.
      • Sort intergenic gffs by number of seqs in alignment and number of gappy seqs (%gaps > 25). Initially, pick 10k sites to train on.
    • After training on intergenic alignments, used trainingSites.m to compute number of sites needed for training based on initial probs, rate matrix, and fractional error of 0.05. Minimum rate (g<->c) was 0.0399 and sites = 5015.
  • ncRna dataset
    • Use Rfam predictions on Dmel. Which ones map to Flybase annotations?
    • Filter with Andrew's set of Rfam alignments with published structures in /nfs/data/db/rfam/pub-trees-annot. Get annotation from these.
    • Extract from Mavid an alignment corresponding to Rfam coordinates. Map annotation onto it. Add flanking data and annotate it as intergenic.

Issues

  • Why is one branch length about 20 times larger than others for intergenic region?
  • Getting "Warning: duplicate sequence name" message when training intergenic.stk. Comes from seq/stockholm.cc, line 732.
  • Annotation counts in gff files don't match counts in v4.3 release notes.

-- Yuri Bendana - 13 Oct 2006