click on the Biowiki logo to go to homepage Home Home | EditEdit | Attach Attach | New New | Site Map Site Map | Help Help
Research Teaching
Fall12 | Sandbox
Biowiki > Teaching > BioE131 > BacterialGenePrediction


Advanced search...


Bacterial Gene Prediction


  • Find open reading frames (ORFs) in a bacterial genome
    • Find the protein translations
  • Compare different gene prediction methods:
    • Compare predicted ORFs to "trusted" annotation
    • Compare to a bacterial genefinding program (Glimmer)
    • Compare to a translated-protein homology search program (exonerate)
  • Build experience manipulating genomes and genome annotations
    • some basic guided data-processing tasks



  • Various perl scripts. You can either download them from the provided in-text links, copy them from /home5/be131/BacterialGenePrediction/, or run them directly from /home5/be131/BacterialGenePrediction/
  • Glimmer ( in /home5/be131/src/ )
  • exonerate ( in /home5/be131/ )


  • Download B.subtilis genome from here: Genbank:NC_000964
    • note that NC_000964 is the Genbank accession number for a particular B.subtilis genome: strain 168, published in Nature in 1997...
      • What other info can you gather from the Genbank record?
    • Next to the "Display" label, select "Genbank(Full)".
    • Select the "Show sequence" under Display options and click "Update View".
    • next to the "Send" label, select "File" and click "Create File".
    • save to a file on your local disk, entitled e.g. "NC_000964.genbank".
  • GenbankFormat is a very rich (and messy) format, containing sequence information, features that have been annotated on the sequence (such as genes) and literature references. The first thing to do is to extract the sequence (FastaFormat) and annotated features (GffFormat).
    • Download this Perl script into your working directory:
    • Make the script executable: chmod +x
    • Run the script on the genome file: NC_000964.genbank
    • This should create files called NC_000964.fasta and NC_000964.gff
    • Type cat NC_000964.gff |cut -f 3|sort -u to get a quick indication of the types of feature in the GFF file
    • Type grep gene NC_000964.gff >NC_000964.genes.gff to extract the protein-coding gene co-ordinates into a separate file
  • Download Glimmer from here ( or alternatively use from /home5/be131/src/ and disregard the following steps)
    • Uncompress the downloaded file by typing tar -xvzf glimmer302.tar.gz
    • This should create a directory glimmer3.02
    • Go into the glimmer3.02/src directory and type make
    • Alter the awkpath and glimmer path to the appropriate values in the glimmer scripts. I've done this for you in /home5/be131/src/glimmer3.02/scripts/g3-from-scratch so if you use this version you don't have to worry about it, only if you're running your own downloaded glimmer
  • The long-orfs program is used as one of the steps in the glimmer analysis process, and outputs a list of all long potential genes on the portion of an ORF from the first start codon to stop codon at the end. Run long-orfs on its own and inspect the results (see readme for more details on what long-orfs does):
    • Type /home5/be131/src/glimmer3.02/bin/long-orfs NC_000964.fasta longOutput.txt
  • Run Glimmer as follows: /home5/be131/src/glimmer3.02/scripts/g3-from-scratch.csh NC_000964.fasta tag
    • What, in outline, are the various steps performed by this shell script? (Hint: The glimmer3.02/doc/glim302notes.pdf file for the Glimmer program might be helpful here)
    • The output goes into the file tag.predict. Type less tag.predict to inspect this file.
  • Convert Glimmer coords to GFF by first downloading this relevant perl script into your working directory:
    • As usual make the script executable: chmod +x
    • Run the script on the Glimmer coordinates and save the converted GFF output to your working directory: NC_000964 tag.predict >output.gff
  • Compare your output to Genbank annotation using
    • chmod +x; NC_000964.gff output.gff >intersected.gff
    • Check out the options associated with script by typing -h
    • How many genes are (i) annotated in Genbank, (ii) predicted by Glimmer?
    • How many of the Glimmer-predicted genes have overlap with the Genbank annotations?


Continue to work on the sequence alignment homework from last week!

Broken Telephone homework due Friday!

Actions: Edit | Attach | New | Ref-By | Printable view | Raw view | Normal view | See diffs | Help | More...