click on the Biowiki logo to go to homepage
Edit Raw Print
Links Diffs RSS
About Stats Recent


Research Teaching Blog
Fall09 | Sandbox
Biowiki > Teaching > Bio E 131 > BacterialGenePrediction

Search

Advanced search...

Topics

PageRank Checker

Bacterial Gene Prediction

Goals

  • Find open reading frames (ORFs) in a bacterial genome
    • Find the protein translations
  • Compare different gene prediction methods:
    • Compare predicted ORFs to "trusted" annotation
    • Compare to a bacterial genefinding program (Glimmer)
    • Compare to a translated-protein homology search program (exonerate)
  • Build experience manipulating genomes and genome annotations
    • some basic guided data-processing tasks

Data

Software

  • Various perl scripts. You can either download them from the provided in-text links, copy them from ~be131/BacterialGenePrediction/, or run them directly from ~be131/BacterialGenePrediction/
  • Glimmer ( in ~be131/src/ )
  • exonerate ( in ~be131/src/ )

Procedure

  • Download B.subtilis genome from here: Genbank:NC_000964
    • note that NC_000964 is the Genbank accession number for a particular B.subtilis genome: strain 168, published in Nature in 1997...
      • What other info can you gather from the Genbank record?
    • Next to the "Display" label, select "Genbank(Full)".
    • Unselect the Hide checkboxes for "Sequence" and "all but gene, CDS and mRNA features".
    • next to the "Show" label, change "Send to" to "File". Ignore the first file it tries to send you.
    • click the "Refresh" button, and save to a file on your local disk, entitled e.g. "NC_000964.genbank". The file size should be >8MB. Verify the file contains both the sequence and the annotations.
  • GenbankFormat is a very rich (and messy) format, containing sequence information, features that have been annotated on the sequence (such as genes) and literature references. The first thing to do is to extract the sequence (FastaFormat) and annotated features (GffFormat).
    • Download this Perl script into your working directory: parse-genbank.pl
    • Make the script executable: chmod +x parse-genbank.pl
    • Run the script on the genome file: parse-genbank.pl NC_000964.genbank
    • This should create files called NC_000964.fasta and NC_000964.gff
    • Type cat NC_000964.gff |cut -f 3|sort -u to get a quick indication of the types of feature in the GFF file
    • Type grep gene NC_000964.gff >NC_000964.genes.gff to extract the protein-coding gene co-ordinates into a separate file
  • Download Glimmer from here ( or just use from ~be131/src/ )
    • Uncompress the downloaded file by typing tar -xvzf glimmer213.tar.gz
    • This should create a directory glimmer2.13
    • Go into the glimmer/src directory and type make
    • Alter the awkpath and glimmer path to the appropriate values in the glimmer scripts. I've done this for you in ~be131/src/glimmer3.02/scripts/g3-from-scratch so if you use this version you don't have to worry about it, only if you're running your own downloaded glimmer
  • Move (or copy) the FASTA file for the B.subtilis genome into the Glimmer directory (where you should now be): mv ../NC_000964.fasta .
  • The long-orfs program is used as one of the steps in the glimmer analysis process, and outputs a list of all long potential genes on the portion of an ORF from the first start codon to stop codon at the end. Run long-orfs on its own and inspect the results:
    • Type long-orfs NC_000964.fasta longOutput.txt
  • Run Glimmer as follows: setenv PATH .:$PATH; ./bin/g3-from-scratch NC_000964.fasta tag
    • Have a look at the shell script run-glimmer2
      • What does the first part of the above command do (setenv PATH .:$PATH)? Why do you need this - or do you? (Try opening a new terminal session, and running the command without the initial setenv)
      • What, in outline, are the various steps performed by this shell script? (Hint: The glimmer/doc/glim302notes.pdf file for the Glimmer program might be helpful here)
      • Note on the built-in command setenv. This command is defined in the C shell. For Bash (Bourne Again SHell) try export PATH=.:$PATH as the first part of your command.
    • The output goes into the file tag.predict. Type less tag.predict to inspect this file.
  • Convert Glimmer coords to GFF by first downloading this relevant perl script into your working directory: glimmer2gff.pl
    • As usual make the script executable: chmod +x glimmer2gff.pl
    • Run the script on the Glimmer coordinates and save the converted GFF output to your working directory: glimmer2gff.pl NC_000964 tag.predict >output.gff
  • Compare your output to Genbank annotation using gffintersect.pl:
    • chmod +x gffintersect.pl; gffintersect.pl NC_000964.gff output.gff >intersected.gff
    • Check out the options associated with gffintersect.pl script by typing gffintersect.pl -h
    • How many genes are (i) annotated in Genbank, (ii) predicted by Glimmer?
    • How many of the Glimmer-predicted genes have a significant overlap (50% or more) with the Genbank annotations?

Homework

Do both of the following homeworks:

Actions: Edit | Attach | New | Ref-By | Printable view | Raw view | Normal view | See diffs | Help | More...