click on the Biowiki logo to go to homepage
Edit Raw Print
Links Diffs RSS
About Stats Recent


Research Teaching Blog
Fall09 | Sandbox
Biowiki > Teaching > BacterialGenePredictionF05

Search

Advanced search...

Topics

PageRank Checker

Goals

  • Find open reading frames (ORFs) in a bacterial genome
    • Find the protein translations
  • Compare different gene prediction methods:
    • Compare predicted ORFs to "trusted" annotation
    • Compare to a bacterial genefinding program (Glimmer)
    • Compare to a translated-protein homology search program (exonerate)
  • Build experience manipulating genomes and genome annotations
    • some basic guided data-processing tasks

Data

Software

  • GffTools: gffintersect.pl
    • gfftools available at ~be131/gfftools/ (you can either use the scripts directly from there or copy them to your own directory)
  • Glimmer
  • gff2ps
  • exonerate

Procedure

  • Once again, assessed tasks for the homework are in boldface.
  • Download B.subtilis genome from here: Genbank:NC_000964
    • note that NC_000964 is the Genbank accession number for a particular B.subtilis genome: strain 168, published in Nature in 1997...
      • What other info can you gather from the Genbank record?
    • next to the "Display" button, select "Genbank(Full)"
    • next to the "Send" button, select "all to file"
    • click on "Send", and save to a file on your local disk, entitled e.g. "NC_000964.genbank"
  • GenbankFormat is a very rich (and messy) format, containing sequence information, features that have been annotated on the sequence (such as genes) and literature references. The first thing to do is to extract the sequence (FastaFormat) and annotated features (GffFormat).
    • Download this Perl script into your working directory: parse-genbank.pl
    • Make the script executable: chmod +x parse-genbank.pl
    • Run the script on the genome file: ./parse-genbank.pl NC_000964.genbank
    • This should create files called NC_000964.fasta= and =NC_000964.gff
    • Type cat NC_000964.gff |cut -f 3|sort -u to get a quick indication of the types of feature in the GFF file
    • Type grep gene NC_000964.gff >NC_000964.genes.gff to extract the protein-coding gene co-ordinates into a separate file
  • Download Glimmer from here
    • Uncompress the downloaded file by typing tar -xvzf glimmer213.tar.gz
    • This should create a directory glimmer2.13
    • Go into this directory and build Glimmer: cd glimmer2.13; make
  • Move (or copy) the FASTA file for the B.subtilis genome into the Glimmer directory (where you should now be): mv ../NC_000964.fasta .
  • Run Glimmer as follows: setenv PATH .:$PATH; run-glimmer2 NC_000964.fasta
    • Have a look at the shell script run-glimmer2
      • What does the first part of the above command do (setenv PATH .:$PATH)? Why do you need this - or do you? (Try opening a new terminal session, and running the command without the initial setenv)
      • What, in outline, are the various steps performed by this shell script? (Hint: The README files for the Glimmer program might be helpful here)
      • Note on the built-in command setenv. This command is defined in the C shell. For Bash (Bourne Again SHell) try export PATH=.:$PATH as the first part of your command.
    • The output goes into the file g2.coord. Type less g2.coord to inspect this file.
  • Convert Glimmer coords to GFF by first downloading this relevant perl script into your working directory: glimmer2gff.pl
    • As usual make the script executable: chmod +x glimmer2gff.pl
    • Run the script on the Glimmer coordinates and save the converted GFF output to your working directory: ./glimmer2gff.pl NC_000964 ./g2.coord >output.gff
  • Compare your output to Genbank annotation using gffintersect.pl:
    • chmod +x gffintersect.pl; ./gffintersect.pl NC_000964.gff output.gff >intersected.gff
    • Check out the options associated with gffintersect.pl script by typing gffintersect.pl -h
    • How many genes are (i) annotated in Genbank, (ii) predicted by Glimmer?
    • How many of the Glimmer-predicted genes have a significant overlap (50% or more) with the Genbank annotations?
  • Try long-orfs on its own before trying glimmer and inspect the results.
    • The program takes a sequence file in FASTA format and outputs a list of all long potential genes on the portion of an ORF from the first start codon to stop codon at the end (from ./long-orfs.readme file).
  • (If time) Download exonerate executable compressed file here and try comparing/matching the sequences containing genes with the original complete sequence of the B.subtilis genome:
    • Unzip the zipped file: tar -xvzf exonerate-1.0.0-linux.tar.gz
    • A file named exonerate-1.0.0-linux will be saved onto your current directory.
    • exonerate executable file is located at exonerate-1.0.0-linux/exonerate/bin/exonerate. Move that to your working directory for the future.
      • Note that exonerate accepts sequence files (i.e. FASTA format files) to compare and align. Your recent output file is in GFF format; so you will need to change the format to FASTA. Use gff2seq.pl perl script for this matter. The script needs three different perl modules as follows. Save them into the same folder as you saved gff2seq.pl script: SeqFileIndex.pm, FileIndex.pm, and GFFTransform.pm.
      • Make the script executable: chmod +x gff2seq.pl
      • Run the script and save the output as follows: ./gff2seq.pl NC_000964.fasta output.gff >output.fasta (Note that the NC_000964.fasta file needs to be in the same directory as gff2seq.pl, so make sure you move/copy that to the right place if things are not working)
      • Match your query output file with the target file containing the complete genome using exonerate: ./exonerate output.fa NC_000964.fasta --showtargetgff yes --model coding2coding --bigseq yes --softmasktarget
      • You may want to save your output to see and inspect the details of your matching.
  • (If time) Run gff2ps to plot the output
    • Download from here
    • Unzip the .gz file using gunzip
    • Then, using your favorite text editor, open up the gff2ps_v0.98d file, find the line
      GAWK="/usr/local/bin/gawk";
      
      and replace it with
      GAWK="/usr/bin/gawk";
      
    • Run the gff2ps_v0.98d program and redirect your results to a .ps file

Procedure

  • What other info can you gather from the Genbank record?
    The Genbank format contains a lot of data about genomes, including the number of base pairs, the various identifiers for this genome, as well as a list of reference literature. It also lists features found in the genome such as genes, coding sequences, introns, exons, tRNAs, etc. For each features, it shows the location, names associated with the feature, and links to the individual records for the features. At the end of the Genbank file, you can find the complete sequence of the genome.

  • What, in outline, are the various steps performed by this shell script? (Hint: The README files for the Glimmer program might be helpful here)
    • The script first uses the program long-orfs to find a list of potential genes in the genome FASTA file, using sequence lengths between start and stop codons as the main criterion.
    • The results of long-orfs are first processed to an appropriate format (by get-putative) and then passed to extract, which retrieves the actual sequence for each of the potential genes found.
    • The sequences of the potential genes are next passed to build-icm, which constructs an interpolated Markov model based on this set of genes.
    • Finally, using the model and the original genome sequence, the program glimmer2 predicts genes in the genome over a number of iterations.

  • How many genes are (i) annotated in Genbank, (ii) predicted by Glimmer?
    The number of genes annotated in Genbank can be determined by counting the number of lines in NC_000964.genes.gff using the UNIX command wc. The result is 2198 genes.

    The number of genes predicted by Glimmer is printed to the screen when you run run-glimmer2 and you can also count the number of lines in output.gff. The result is 4623.

  • How many of the Glimmer-predicted genes have a significant overlap (50% or more) with the Genbank annotations?
    Using the script gffintersect.pl, we can determine the number of Glimmer-predicted genes that overlap with Genbank annotations. Since we would like to check each of the Glimmer genes, the file generated by Glimmer (output.gff) should be passed in as the second file. To specify we want at least 50% overlap, we should use the -minfrac2 0.5 option. (I accepted the answer from using the -minfrac1 0.5 option as well, since it seems like you can also interpret the question this way.) The instructions above says to compare the predictions to NC_000964.gff but actually, we should have compared to NC_000964.genes.gff since we're interested only in the annotations for genes.

    $ ./gffintersect.pl -minfrac2 0.5 NC_000964.genes.gff output.gff | wc

    would give 2194 as a result.

    $ ./gffintersect.pl -minfrac1 0.5 NC_000964.genes.gff output.gff | wc

    would give 2114 as a result (or 2117 if you compared it to NC_0009634.gff).

Actions: Edit | Attach | New | Ref-By | Printable view | Raw view | Normal view | See diffs | Help | More...