Fly Nc Rna

From Biowiki
Jump to: navigation, search

Notes on 12-fly ncRNA stuff

---

---

Status Overview

  • Made alignments with cmalign (see [[1]])
  • Predicted substitution rates using the cmalign alignments (see [[2]]).
  • Working on making fresh stemloc alignments. Rate measurements will soon follow.

---

Rate Measurements

Species trees can be found at [[3]], but these are really only useful to us if we have a family in which there are no paralogs. In this case we know the "true" gene tree and can measure absolute substitution rates in stem and loop regions rather than just the ratio.

Alignments

Successful sequence alignment files (in CVS and in /nfs/projects/12fly-ncRNA/) suitable for input to xrate:

analysis/RF00001_all.tedious.LAST.stock
analysis/RF00003_all.slow1000.LAST.stock
analysis/RF00009_all.tedious1000.LAST.stock
analysis/RF00015_all.slow.LAST.stock
analysis/RF00017_all.tedious.LAST.stock 
analysis/RF00020_all.slow.LAST.stock
analysis/RF00026_all.slow.LAST.stock
analysis/RF00031_all.slow.LAST.stock
analysis/RF00161_all.slow.LAST.stock
analysis/RF00168_all.slow.LAST.stock
analysis/RF00207_all.slow.LAST.stock
analysis/RF00227_all.slow.LAST.stock
analysis/RF00433_all.slow.LAST.stock
analysis/RF00437_all.slow.LAST.stock
analysis/RF00485_all.slow.LAST.stock
analysis/RF00491_all.slow.LAST.stock

Grammars

All of these are in CVS and in /nfs/projects/12fly-ncRNA/rates/grammars/, but also here just in case.

Pfold grammar with rates estimated by Knudsen and Hein: pfold.eg

Pfold grammar with rates estimated by Andrew Uzilov: rfam-training-set.eg. Estimation was from 71 representative families and can be re-created as follows:

cd /nfs/projects/xgram-paper/ncrna/33__extention-of-32/

xgram Training/rfam-training-set.stock --grammar Grammars/pfold-flat-seed__0.001.eg --pseudinitial 0.001 --pseudmutate 0.001 --mininc 0.0001 --forgive 10 --train Rates/rfam-training-set.eg --noannotate -rndtime

The above estimation procedure yields the maximum P(alignment,annotation|param_set) on training data out of all the trials.

---

Sequence Alignments (stemloc)

Click on DONE to get the stemloc output (alignment and secondary structure in Stockholm format).

Or, see the colorized versions.

family # of seqs status stemloc flags notes
RF00001 601 DONE -tedious --progressive only runs on hi-mem node; -slow will also align 100% of sequences, but since we happen to have -tedious, I am posting that one here as it should be better; TODO: rerun after bugfix to ensure alignment is the same! (use -slow, reserve 11 days for it... although -tedious will take the same amount of time, though... maybe run both?)
RF00002 638 cancelled, won't get done -fast flag only aligns 2 seqs; -slow will take weeks to finish and will require more memory than we can provide even on the hi-mem node
RF00003 98 DONE -slow -na 1000 --progressive
RF00004 95 partially done (87/95 seqs; WARNING: produced error of some sort, this alignment may not be valid!) -slow --progressive keeps producing "Paired columns ... don't match" error (does it on -slow --progressive and -slow -na 1000 --progressive): NOT TO BE TRUSTED!
RF00009 12 DONE -tedious -na 1000 --progressive aligns 100% of input sequences using -tedious alone, but since I also ran it with the looser envelope, I am posting that result, as it should be a better alignment
RF00015 36 DONE -slow --progressive
RF00017 24 DONE -tedious --progressive
RF00020 83 DONE -slow --progressive this alignment is from the pre-bugfix (of the memory iterator) stemloc, as the post-bugfix version fails to do 100% of the alignment, outputting errors
RF00026 43 DONE -slow --progressive
RF00028 6 probably cancelled -slow --progressive refuses to produce any output for unknown reasons; TODO: get to the bottom of this!
RF00031 14 DONE -slow --progressive
RF00032 1266 cancelled, won't get done -slow --progressive fubared for multiple known and unknown reason (probably because aligment has too many sequences, currently unfeasible)
RF00161 4 DONE -slow --progressive
RF00168 2 DONE -slow --progressive
RF00207 3 DONE -slow --progressive
RF00227 7 DONE -slow --progressive
RF00433 5 DONE -slow --progressive
RF00437 5 DONE -slow --progressive
RF00485 67 DONE -slow --progressive this alignment is from the pre-bugfix (of the memory iterator) stemloc, as the post-bugfix version fails to do 100% of the alignment, outputting errors
RF00491 2 DONE -slow --progressive
miRNA 915  ??? -slow --progressive died part way through writing output... probably ran out of memory (rerun on kosh?) - may be unfeasible to complete
snoRNA 2515  ??? -slow --progressive died part way through writing output... probably ran out of memory (rerun on kosh?) - may be unfeasible to complete
tRNA 3879 cancelled, won't get done any was fubared with memory problems; currently unfeasible to do an alignment that large

RF00240 was dropped because it only has 1 sequence in it.

Issues

  • stemloc is not catching memory allocation exceptions... or is it? what's happening here (and why is it not setting a nonzero exit status):
terminate called after throwing an instance of 'std::bad_alloc'
  what():  St9bad_alloc
  • what is stemloc's "Randomising degenerate characters"? is it just picking a nuc out of a hat?
  • There are many instances where stemloc will not produce a multiple alignment, but won't tell you why (although a strong hint is all the "-infinity alignment score" warnings). The output will show the pairwise alignments attempted, but no multiple alignment. Our envelopes are probably too restrictive.
    • there should be an explicit error message listing all the input sequences that could not be added to the final multiple alignment
  • RF00240 has been thrown out, as it contains only one sequence among all the species.
    • When you run it on stemloc, it just exits with no output. It should either generate a structure or an error.

---

Sequence extraction

To Do

  • TWikiDocGraphics.choice-yes.gif Extract predicted ncRNAs from CAF1 assemblies.
    • TWikiDocGraphics.choice-yes.gif Filter out predictions from foreign scaffolds.
    • TWikiDocGraphics.choice-yes.gif Do the extraction (to Stockholm format).
    • TWikiDocGraphics.choice-yes.gif Spot-check to make sure extracted sequences are similar to alignments in Rfam for their family.
    • TWikiDocGraphics.choice-yes.gif Convert to FASTA format (for input to Stem Loc).
    • TWikiDocGraphics.choice-yes.gif Download or create "readmes" for all of our data sources and put them in source/.
  • Write makefile to automate the creation of datasets we'll need.
    • Rules to include:
      • TWikiDocGraphics.choice-yes.gif filter out contaminated scaffolds
      • TWikiDocGraphics.choice-yes.gif run extraction program (with logging) and move/organize resulting data to extractions/ dir
      • TWikiDocGraphics.choice-yes.gif aggregate extractions/predictions to have all predictions for each family for all the 12 species
      • TWikiDocGraphics.choice-yes.gif extract the raw sequence with some ID from a Stockholm file (for Gabriel's quality control)

Issues

  • Why are there exon feature types in GFF3 predictions for tRNA? (currently, these are dropped when extracting sequence)
  • There are 2 feature types for miRNA predictions: stem_loop (long, stem-loop precursor?) and miRNA (short, just the guide portion?) - presumably we want to include just the stem-loop in the alignment, right? (this is what are currently doing)
  • Why are there so many dmel sequences in Rfam seed alignments that aren't being found by the Infernal scan?
    • eg:
      • Rfam:RF00437 has 3 seqs from dmel in Rfam, but only 1 is found in dmel by Infernal
      • Rfam:RF00031 has 4 seqs from dmel in Rfam, but only 3 are found in dmel by Infernal

* the 1 that's not found can actually be found by grep, so we know it's in there somewhere...

    • this is odd... shouldn't the CM pick up in the genome the very data it was trained on?
      • IH: not necessarily. This is clearly a desirable property of a profile, but unfortunately not a guaranteed one.
      • maybe this sequence is getting thrown out at some other stage in their pipeline... this is trouble, it could be biasing our data (maybe towards higher-identity ncRNA?)

---

Organization

Project dir on cluster: /nfs/projects/12fly-ncRNA/

It is also in the buffy CVS.

Note: the source/*.fasta files and some stuff in test/ aren't in CVS because they are:

  • too massive and won't change, or
  • are temporary.

The former are the GenBank-processed CAF1 FASTA files that can be downloaded from many places, such as the AAA wiki.


++++ CVS checkout instructions

Assuming you're using bash:

export CVS_RSH=ssh
cvs -d:ext:<username>@buffy:/usr/local/cvs co 12fly-ncRNA

---

Interesting trends observed in the data

  • Rfam:RF00032 (histone 3' UTR element) seems to be more prevalent than anything else across multiple species
    • IH: makes sense, it's part of the transcriptional machinery... what about Rfam:RF00161 (nanos TCE)?
      • AU: one hit in each of dmel, dsec, and dvir for RF00161... but dyak and dsim aren't done yet (harder to extract due to Genbank breaking up many of the scaffolds), so I can't vouch for those yet

---

-- Created on: 01 Dec 2006