Eight Fly Scan

From Biowiki
Jump to: navigation, search


The goal is to identify non-coding RNA genes using a 8 species multiple alignment of the Drosophila genome. The alignment was created using MAVID from Lior Pachter's group.

The method uses pfoldscan.pl to scan the alignment in 200 bp sliding windows which slide by 50 bp. The script calls the Pfold program to predict RNA secondary structure. The Pfold program was modified slightly to output the posterior probability of a predicted structure. A fixed mean phylogenetic tree from Dan Pollard was used as input. The pfoldscan.pl output consists of the start position, end position, and probability score for the window. The scan2gff.pl script is used to convert the scan coordinates into genome coordinates in GFF format using the orthology map file which comes with the MSA. scan2gff.pl will output lines for a given input species, by default it's D.Mel. To filter overlapping regions a self-intersect on the GFF output was done using the gffintersect.pl script. When regions overlap, the region with the minimum score is most significant so it is retained using the intersectlookup.pl script.

Then the histogram of the scores was generated to see the distribution. Ideally the annotated ncRna scores would be shifted to the right relative to the complete scan output. More likely the ncRna curve will fall under the total curve; ie, it will be in the noise.

The annotated Drosophila Melanogaster genome in GFF format version 4.1 was downloaded from Flybase. The annotations are used to evaluate the accuracy of the predictions. All the ncRna annotations were combined into one file. All the gene elements (eg, exon, intron) were combined into another file. We are looking for ncRnas outside of annotated genes because those that are in introns will be harder to work with experimentally. To ensure this, a negative intersection was run to filter regions of the scan output which intersect with known genes. The output of this intersect is then intersected with the ncRna annotations. This produces the positive and negative actual ncRna. An ROC curve is then generated from these evalutions. The optimum score threshold is where the tangent to the ROC curve has a slope of 1.

The top positive and negative actual sequences were then extracted from the MSA. The sequences were then scanned against Rfam to determine whether any ncRna profiles matched. Perhaps some of the unannotated predictions are false negatives.

The sequences of the top hits were converted to Stockholm format. The sscolor.pl script was run to color the secondary structure. This gives visual confirmation if covariant substitutions are present in an alignment. Since covariant substitutions are more conserved, this evolutionary signal would indicate that an RNA-like structure is present. If the stem is very long (> 100 nt), this could indicate a transposon which has long inverted repeats. Also, loops tend to be more conserved in ncRnas since this is likely the functional region.

This analysis was repeated using the Xfold program. However, in this case, the tree branch lengths are estimated from fixed topology for every window using the baseml program from the PAML package.


The Pfold scan of 4/18/2005 produced ambiguous output. The distribution of scores was ?. The ROC curve is shown below. The optimum threshold was determined to be a score of about -97 with TPF = ? and FPF = ?. No matches to Rfam profiles from the top predictions were generated. The coloring of the secondary structure of the top predictions were not convincing evidence of Rna structure. It was concluded that using a fixed tree is not the best method because the fixed branch lengths can overestimate in some cases (1.2 vs 0.2) the evolutionary distance.

The Pfold scan of 8/17/2005 has a distribution of scores that is also more normally distributed than before. Only 17 positive actuals were obtained versus 1355 negative actuals. The optimum threshold from the ROC plot is roughly about thr=-62 with FPF=0.37 and TPF=0.65. Above this threshold, the slope of the line is always < 1. Below this threshold, the slope may be >1 or <1. There are 36 subseq below the threshold. Rfam+Blast returned no hits. Complete Rfam scan of top 10 subseq returned ?. The top subseq had score of -182, length of 155, and in chr 3R. The next best hit had a score of -162, length=200, and is same region of 3L chr as best hit of Xfold. This subseq was submitted to the Rfam web server and 0 hits were returned.

The Xfold scan of 01/31/2006 has a distribution of scores that is more normally distributed. When intersecting against the annotated ncRnas, only 16 positive actuals were obtained. There were 1356 negative actuals. The ROC plot shows 16 steps pertaining to these ncRnas. The optimum threshold was determined approximately to be around a threshold = -148, with a TPF of 0.63 and FPF = 0.26. There are 35 subsequences below this threshold, including one of length 4 and 8. These were removed. The subsequences were retrieved from the alignment and scanned against Rfam. The Blast scan of Rfam returned no hits. The complete scan of Rfam using top 10 subsequences returned also 0 hits. The top subsequence with length of 188 had score of -377, located in the 3L chr. It was submitted to the Rfam webserver and no hits were returned.



  • Run scan on genome using pfoldscan.pl. Output is directory of .nfoldscan files.
  • Create gff output running scan2gff.pl. Extract lines for Dro Mel species.
    • nohup scan2gff.pl ~/8fly/scan ~/8fly/align > dmelPf.gff {Default species is 'DroMel_4'}
  • Edit gff output by removing the 'chr' from chromosome names.
    • cat dmelPf.gff | tr -d 'chr' > pfTr.gff
  • Sort the gff file by name, start and end fields.
    • nohup gffsort.pl pfTr.gff > pfSorted.gff
  • For large files, if low on memory, break up the sorted file by chromosome:
    • grep '^2L' pfSorted.gff > pfSorted2L.gff
  • To detect overlapping regions, do a self-intersect on sorted output:
    • nohup gffintersect.pl -self pfSorted.gff > pfSelfInt.gff &
  • Filter the self intersect output to get the minimum score of each set of overlaps:
    • nohup intersectlookup.pl -self -minscore pfSelfInt.gff > pfMinscore.gff &
  • Plot histogram of scores in R to see their distribution.
    • ~/R/Dmel/createHist.r
  • Split the Dro Mel chromosome files by type:
    • gffbytype.pl -o All Chr/Type Chr2L/dmel-2L-r4.1.gff Chr2R/dmel-2R-r4.1.gff Chr3L/dmel-3L-r4.1.gff Chr3R/dmel-3R-r4.1.gff Chr4/dmel-4-r4.1.gff Chr X/dmel-X-r4.1.gff &
  • Combine the non-coding RNA (tRNA, ncRNA, snRNA, snoRNA, rRNA) files to form a concatenated non-coding RNA file:
    • gffsort.pl tRNA.gff .... > catNcRna.gff
  • Create gff for all gene elements:
    • gffsort.pl CDS.gff DNA_motif.gff aberration_junction.gff enhancer.gff exon.gff five_prime_UTR.gff gene.gff insertion_site.gff intron.gff mRNA.gff match_tRNAscan-SE.gff match_part_tRNAscan-SE.gff mature_peptide.gff polyA_site.gff protein_binding_site.gff pseudogene.gff regulatory_region.gff repeat_region.gff signal_peptide.gff three_prime_UTR.gff transcription_start_site.gff transposable_element.gff transposable_element_insertion_site.gff > catGene.gff &
  • Since we are primarily interested in ncRna's outside of genes, run neg intersect with gene seqs to filter them out of the Pfold seqs:
    • gffintersect.pl -not catGene.gff pfMinscore.gff > pfMinScoreNonGene.gff &
  • Do intersect of Pfold output (with or without gene seqs) with annotated ncRNAs to determine positive and negative actuals.
    • gffintersect.pl catNcRna.gff pfMinscore.gff > catNcRnaPa.gff
    • gffintersect.pl -not catNcRna.gff pfMinscore.gff > catNcRnaNa.gff
    • gffintersect.pl catNcRna.gff pfMinscoreNonGene.gff > catNcRnaPaNG.gff (check that it is 0 size)
    • gffintersect.pl -not catNcRna.gff pfMinscoreNonGene.gff > catNcRnaNaNG.gff
  • Plot ROC curve in R:
    • createRoc.r (use the unfiltered data since positive actuals will be inside known genes)
  • Determine the optimum score threshold, defined as point above which the slope is always < 1: rocPlotAvg.r
  • Extract sequences for top negative actuals in nongene sequences.
  • Example of getting a set of alignments with a score above a threshold determined by ROC curve:
    • gffsort.pl -minSc catNcRnaNaNG.gff > catNcRnaNaNGSort.gff
    • gfffilter.pl '$gffscore < -97' catNcRnaNaNGSort.gff > catNcRnaNaNGThr-97.gff
    • nohup ~/pfold/gff2alignSeq.pl -ng ~/8fly/align catNcRnaNaNGThr-97.gff > catNcRnaNaNGThr-97.fa &
  • Scan top hits against Rfam to see if they match a profile.
  • Use the Blast search when several seqs are scanned. --noclean leaves files in /tmp for debugging.
    • nohup ~/Rfam/rfam_scan.pl --noclean -d ~/Rfam/Db -f gff catNcRnaNaNGThr-97.fa > catNcRnaNaNGThr-97_rfam.gff &
  • Scan top 10 non-gene hits against complete Rfam (no Blast):
    • nohup ~/Rfam/rfam_scan.pl --noblast --noclean -d ~/Rfam/Db -f gff catNcRnaNaNGTop10.fa > catNcRnaNaNGTop10_rfam.gff &
  • Get msa's of top hits:
    • ~/pfold/gff2alignSeq.pl -msa ~/8fly/align catNcRnaNaNGTop10.gff> catNcRnaNaNGTop10.afa
  • Run doTest.pl script to get tree and alignment in stockholm format (verifies score too):
    • Xfold: ~/pfold/test/doTest.pl -x -paml topHit.afa -o topHit
    • Pfold: ~/pfold/test/doTest.pl -paml topHit.afa -o topHit
  • To get predicted secondary structure for Pfold, submit alignment to Pfold Web Server.
  • Color alignment for secondary structure. Make sure the secondary struc is in line "#GC SS_cons".
    • ~/dart/perl/sscolor.pl -cols 50 topHit.stk > topHit.html

</noautolink> -- Yuri Bendana - 18 Oct 2006