Fly Nc Rna
Contents
Notes on 12-fly ncRNA stuff
---
---
Status Overview
- Made alignments with cmalign (see [[1]])
- Predicted substitution rates using the cmalign alignments (see [[2]]).
- Working on making fresh stemloc alignments. Rate measurements will soon follow.
---
Rate Measurements
Species trees can be found at [[3]], but these are really only useful to us if we have a family in which there are no paralogs. In this case we know the "true" gene tree and can measure absolute substitution rates in stem and loop regions rather than just the ratio.
Alignments
Successful sequence alignment files (in CVS and in /nfs/projects/12fly-ncRNA/) suitable for input to xrate:
analysis/RF00001_all.tedious.LAST.stock analysis/RF00003_all.slow1000.LAST.stock analysis/RF00009_all.tedious1000.LAST.stock analysis/RF00015_all.slow.LAST.stock analysis/RF00017_all.tedious.LAST.stock analysis/RF00020_all.slow.LAST.stock analysis/RF00026_all.slow.LAST.stock analysis/RF00031_all.slow.LAST.stock analysis/RF00161_all.slow.LAST.stock analysis/RF00168_all.slow.LAST.stock analysis/RF00207_all.slow.LAST.stock analysis/RF00227_all.slow.LAST.stock analysis/RF00433_all.slow.LAST.stock analysis/RF00437_all.slow.LAST.stock analysis/RF00485_all.slow.LAST.stock analysis/RF00491_all.slow.LAST.stock
Grammars
All of these are in CVS and in /nfs/projects/12fly-ncRNA/rates/grammars/, but also here just in case.
Pfold grammar with rates estimated by Knudsen and Hein: pfold.eg
Pfold grammar with rates estimated by Andrew Uzilov: rfam-training-set.eg. Estimation was from 71 representative families and can be re-created as follows:
cd /nfs/projects/xgram-paper/ncrna/33__extention-of-32/
xgram Training/rfam-training-set.stock --grammar Grammars/pfold-flat-seed__0.001.eg --pseudinitial 0.001 --pseudmutate 0.001 --mininc 0.0001 --forgive 10 --train Rates/rfam-training-set.eg --noannotate -rndtime
The above estimation procedure yields the maximum P(alignment,annotation|param_set) on training data out of all the trials.
---
Sequence Alignments (stemloc)
Click on DONE to get the stemloc output (alignment and secondary structure in Stockholm format).
Or, see the colorized versions.
family | # of seqs | status | stemloc flags | notes |
RF00001 | 601 | DONE | -tedious --progressive | only runs on hi-mem node; -slow will also align 100% of sequences, but since we happen to have -tedious, I am posting that one here as it should be better; TODO: rerun after bugfix to ensure alignment is the same! (use -slow, reserve 11 days for it... although -tedious will take the same amount of time, though... maybe run both?) |
|
638 | cancelled, won't get done | -fast flag only aligns 2 seqs; -slow will take weeks to finish and will require more memory than we can provide even on the hi-mem node | |
RF00003 | 98 | DONE | -slow -na 1000 --progressive | |
RF00004 | 95 | partially done (87/95 seqs; WARNING: produced error of some sort, this alignment may not be valid!) | -slow --progressive | keeps producing "Paired columns ... don't match" error (does it on -slow --progressive and -slow -na 1000 --progressive): NOT TO BE TRUSTED! |
RF00009 | 12 | DONE | -tedious -na 1000 --progressive | aligns 100% of input sequences using -tedious alone, but since I also ran it with the looser envelope, I am posting that result, as it should be a better alignment |
RF00015 | 36 | DONE | -slow --progressive | |
RF00017 | 24 | DONE | -tedious --progressive | |
RF00020 | 83 | DONE | -slow --progressive | this alignment is from the pre-bugfix (of the memory iterator) stemloc, as the post-bugfix version fails to do 100% of the alignment, outputting errors |
RF00026 | 43 | DONE | -slow --progressive | |
RF00028 | 6 | probably cancelled | -slow --progressive | refuses to produce any output for unknown reasons; TODO: get to the bottom of this! |
RF00031 | 14 | DONE | -slow --progressive | |
|
1266 | cancelled, won't get done | -slow --progressive | fubared for multiple known and unknown reason (probably because aligment has too many sequences, currently unfeasible) |
RF00161 | 4 | DONE | -slow --progressive | |
RF00168 | 2 | DONE | -slow --progressive | |
RF00207 | 3 | DONE | -slow --progressive | |
RF00227 | 7 | DONE | -slow --progressive | |
RF00433 | 5 | DONE | -slow --progressive | |
RF00437 | 5 | DONE | -slow --progressive | |
RF00485 | 67 | DONE | -slow --progressive | this alignment is from the pre-bugfix (of the memory iterator) stemloc, as the post-bugfix version fails to do 100% of the alignment, outputting errors |
RF00491 | 2 | DONE | -slow --progressive | |
miRNA | 915 | ??? | -slow --progressive | died part way through writing output... probably ran out of memory (rerun on kosh?) - may be unfeasible to complete |
snoRNA | 2515 | ??? | -slow --progressive | died part way through writing output... probably ran out of memory (rerun on kosh?) - may be unfeasible to complete |
|
3879 | cancelled, won't get done | any | was fubared with memory problems; currently unfeasible to do an alignment that large |
RF00240 was dropped because it only has 1 sequence in it.
Issues
- stemloc is not catching memory allocation exceptions... or is it? what's happening here (and why is it not setting a nonzero exit status):
terminate called after throwing an instance of 'std::bad_alloc' what(): St9bad_alloc
- what is stemloc's "Randomising degenerate characters"? is it just picking a nuc out of a hat?
- There are many instances where stemloc will not produce a multiple alignment, but won't tell you why (although a strong hint is all the "-infinity alignment score" warnings). The output will show the pairwise alignments attempted, but no multiple alignment. Our envelopes are probably too restrictive.
- there should be an explicit error message listing all the input sequences that could not be added to the final multiple alignment
- RF00240 has been thrown out, as it contains only one sequence among all the species.
- When you run it on stemloc, it just exits with no output. It should either generate a structure or an error.
---
Sequence extraction
To Do
-
Extract predicted ncRNAs from CAF1 assemblies.
-
Filter out predictions from foreign scaffolds.
-
Do the extraction (to Stockholm format).
-
Spot-check to make sure extracted sequences are similar to alignments in Rfam for their family.
-
Convert to FASTA format (for input to Stem Loc).
-
Download or create "readmes" for all of our data sources and put them in source/.
-
- Write makefile to automate the creation of datasets we'll need.
- Rules to include:
-
filter out contaminated scaffolds
-
run extraction program (with logging) and move/organize resulting data to extractions/ dir
-
aggregate extractions/predictions to have all predictions for each family for all the 12 species
-
extract the raw sequence with some ID from a Stockholm file (for Gabriel's quality control)
-
- Rules to include:
Issues
- Why are there exon feature types in GFF3 predictions for tRNA? (currently, these are dropped when extracting sequence)
- There are 2 feature types for miRNA predictions: stem_loop (long, stem-loop precursor?) and miRNA (short, just the guide portion?) - presumably we want to include just the stem-loop in the alignment, right? (this is what are currently doing)
- Why are there so many dmel sequences in Rfam seed alignments that aren't being found by the Infernal scan?
- eg:
- Rfam:RF00437 has 3 seqs from dmel in Rfam, but only 1 is found in dmel by Infernal
- Rfam:RF00031 has 4 seqs from dmel in Rfam, but only 3 are found in dmel by Infernal
- eg:
* the 1 that's not found can actually be found by grep, so we know it's in there somewhere...
- this is odd... shouldn't the CM pick up in the genome the very data it was trained on?
- IH: not necessarily. This is clearly a desirable property of a profile, but unfortunately not a guaranteed one.
- maybe this sequence is getting thrown out at some other stage in their pipeline... this is trouble, it could be biasing our data (maybe towards higher-identity ncRNA?)
- this is odd... shouldn't the CM pick up in the genome the very data it was trained on?
---
Organization
Project dir on cluster: /nfs/projects/12fly-ncRNA/
It is also in the buffy CVS.
Note: the source/*.fasta files and some stuff in test/ aren't in CVS because they are:
- too massive and won't change, or
- are temporary.
The former are the GenBank-processed CAF1 FASTA files that can be downloaded from many places, such as the AAA wiki.
++++ CVS checkout instructions
Assuming you're using bash:
export CVS_RSH=ssh cvs -d:ext:<username>@buffy:/usr/local/cvs co 12fly-ncRNA
---
Interesting trends observed in the data
- Rfam:RF00032 (histone 3' UTR element) seems to be more prevalent than anything else across multiple species
- IH: makes sense, it's part of the transcriptional machinery... what about Rfam:RF00161 (nanos TCE)?
- AU: one hit in each of dmel, dsec, and dvir for RF00161... but dyak and dsim aren't done yet (harder to extract due to Genbank breaking up many of the scaffolds), so I can't vouch for those yet
- IH: makes sense, it's part of the transcriptional machinery... what about Rfam:RF00161 (nanos TCE)?
---
-- Created on: 01 Dec 2006