Home - this site is powered by TWiki(R)
Teaching > BioE131 > BacterialGenePrediction
TWiki webs: Main | TWiki | Sandbox   Log In or Register

Bacterial Gene Prediction

Goals

  • Find open reading frames (ORFs) in a bacterial genome
    • Find the protein translations
  • Compare different gene prediction methods:
    • Compare predicted ORFs to "trusted" annotation
    • Compare to a bacterial genefinding program (Glimmer)
    • Compare to a translated-protein homology search program (exonerate)
  • Build experience manipulating genomes and genome annotations
    • some basic guided data-processing tasks

Data

Software

  • Various perl scripts. You can either download them from the provided in-text links, copy them from /home5/be131/BacterialGenePrediction/, or run them directly from /home5/be131/BacterialGenePrediction/
  • Glimmer ( in /home5/be131/src/ )
  • exonerate ( in /home5/be131/ )

Procedure

  • Download B.subtilis genome from here: Genbank:NC_000964
    • note that NC_000964 is the Genbank accession number for a particular B.subtilis genome: strain 168, published in Nature in 1997...
      • What other info can you gather from the Genbank record?
    • Next to the "Display" label, select "Genbank(Full)".
    • Select the "Show sequence" under Display options and click "Update View".
    • next to the "Send" label, select "File" and click "Create File".
    • save to a file on your local disk, entitled e.g. "NC_000964.genbank".
  • GenbankFormat is a very rich (and messy) format, containing sequence information, features that have been annotated on the sequence (such as genes) and literature references. The first thing to do is to extract the sequence (FastaFormat) and annotated features (GffFormat).
    • Download this Perl script into your working directory: parse-genbank.pl
    • Make the script executable: chmod +x parse-genbank.pl
    • Run the script on the genome file: parse-genbank.pl NC_000964.genbank
    • This should create files called NC_000964.fasta and NC_000964.gff
    • Type cat NC_000964.gff |cut -f 3|sort -u to get a quick indication of the types of feature in the GFF file
    • Type grep gene NC_000964.gff >NC_000964.genes.gff to extract the protein-coding gene co-ordinates into a separate file
  • Download Glimmer from here ( or alternatively use from /home5/be131/src/ and disregard the following steps)
    • Uncompress the downloaded file by typing tar -xvzf glimmer302.tar.gz
    • This should create a directory glimmer3.02
    • Go into the glimmer3.02/src directory and type make
    • Alter the awkpath and glimmer path to the appropriate values in the glimmer scripts. I've done this for you in /home5/be131/src/glimmer3.02/scripts/g3-from-scratch so if you use this version you don't have to worry about it, only if you're running your own downloaded glimmer
  • The long-orfs program is used as one of the steps in the glimmer analysis process, and outputs a list of all long potential genes on the portion of an ORF from the first start codon to stop codon at the end. Run long-orfs on its own and inspect the results (see readme for more details on what long-orfs does):
    • Type /home5/be131/src/glimmer3.02/bin/long-orfs NC_000964.fasta longOutput.txt
  • Run Glimmer as follows: /home5/be131/src/glimmer3.02/scripts/g3-from-scratch.csh NC_000964.fasta tag
    • What, in outline, are the various steps performed by this shell script? (Hint: The glimmer3.02/doc/glim302notes.pdf file for the Glimmer program might be helpful here)
    • The output goes into the file tag.predict. Type less tag.predict to inspect this file.
  • Convert Glimmer coords to GFF by first downloading this relevant perl script into your working directory: glimmer2gff.pl
    • As usual make the script executable: chmod +x glimmer2gff.pl
    • Run the script on the Glimmer coordinates and save the converted GFF output to your working directory: glimmer2gff.pl NC_000964 tag.predict >output.gff
  • Compare your output to Genbank annotation using gffintersect.pl:
    • chmod +x gffintersect.pl; gffintersect.pl NC_000964.gff output.gff >intersected.gff
    • Check out the options associated with gffintersect.pl script by typing gffintersect.pl -h
    • How many genes are (i) annotated in Genbank, (ii) predicted by Glimmer?
    • How many of the Glimmer-predicted genes have overlap with the Genbank annotations?

Homework

Continue to work on the sequence alignment homework from last week!

Broken Telephone homework due Friday!

Edit | Attach | Print version | History: r172 < r171 < r170 < r169 < r168 | Backlinks | Raw View | Raw edit | More topic actions


Parents: BioE131
This site is powered by the TWiki collaboration platformCopyright © 2008-2014 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
TWiki Appliance - Powered by TurnKey Linux