|
| Bacterial Gene Prediction
Goals
- Find open reading frames (ORFs) in a bacterial genome
- Find the protein translations
- Compare different gene prediction methods:
- Compare predicted ORFs to "trusted" annotation
- Compare to a bacterial genefinding program (Glimmer)
- Compare to a translated-protein homology search program (exonerate)
- Build experience manipulating genomes and genome annotations
- some basic guided data-processing tasks
Data
Software
- Various perl scripts. You can either download them from the provided in-text links, copy them from
~be131/BacterialGenePrediction/, or run them directly from ~be131/BacterialGenePrediction/
- Glimmer ( in
~be131/src/ )
- exonerate ( in
~be131/src/ )
Procedure
- Download B.subtilis genome from here: Genbank:NC_000964
- note that NC_000964 is the Genbank accession number for a particular B.subtilis genome: strain 168, published in Nature in 1997...
- What other info can you gather from the Genbank record?
- Next to the "Display" label, select "Genbank(Full)".
- Unselect the Hide checkboxes for "Sequence" and "all but gene, CDS and mRNA features".
- next to the "Show" label, change "Send to" to "File". Ignore the first file it tries to send you.
- click the "Refresh" button, and save to a file on your local disk, entitled e.g. "NC_000964.genbank". The file size should be >8MB. Verify the file contains both the sequence and the annotations.
- GenbankFormat is a very rich (and messy) format, containing sequence information, features that have been annotated on the sequence (such as genes) and literature references. The first thing to do is to extract the sequence (FastaFormat) and annotated features (GffFormat).
- Download this Perl script into your working directory: parse-genbank.pl
- Make the script executable:
chmod +x parse-genbank.pl
- Run the script on the genome file:
parse-genbank.pl NC_000964.genbank
- This should create files called
NC_000964.fasta and NC_000964.gff
- Type
cat NC_000964.gff |cut -f 3|sort -u to get a quick indication of the types of feature in the GFF file
- Type
grep gene NC_000964.gff >NC_000964.genes.gff to extract the protein-coding gene co-ordinates into a separate file
- Download Glimmer from here ( or just use from
~be131/src/ )
- Uncompress the downloaded file by typing
tar -xvzf glimmer213.tar.gz
- This should create a directory
glimmer2.13
- Go into the glimmer/src directory and type
make
- Alter the awkpath and glimmer path to the appropriate values in the glimmer scripts. I've done this for you in
~be131/src/glimmer3.02/scripts/g3-from-scratch so if you use this version you don't have to worry about it, only if you're running your own downloaded glimmer
- Move (or copy) the FASTA file for the B.subtilis genome into the Glimmer directory (where you should now be):
mv ../NC_000964.fasta .
- The long-orfs program is used as one of the steps in the glimmer analysis process, and outputs a list of all long potential genes on the portion of an ORF from the first start codon to stop codon at the end. Run
long-orfs on its own and inspect the results:
- Type
long-orfs NC_000964.fasta longOutput.txt
- Run Glimmer as follows:
setenv PATH .:$PATH; ./bin/g3-from-scratch NC_000964.fasta tag
- Have a look at the shell script
run-glimmer2
- What does the first part of the above command do (
setenv PATH .:$PATH)? Why do you need this - or do you? (Try opening a new terminal session, and running the command without the initial setenv)
- What, in outline, are the various steps performed by this shell script? (Hint: The
glimmer/doc/glim302notes.pdf file for the Glimmer program might be helpful here)
- Note on the built-in command
setenv. This command is defined in the C shell. For Bash (Bourne Again SHell) try export PATH=.:$PATH as the first part of your command.
- The output goes into the file
tag.predict. Type less tag.predict to inspect this file.
- Convert Glimmer coords to GFF by first downloading this relevant perl script into your working directory: glimmer2gff.pl
- As usual make the script executable:
chmod +x glimmer2gff.pl
- Run the script on the Glimmer coordinates and save the converted GFF output to your working directory:
glimmer2gff.pl NC_000964 tag.predict >output.gff
- Compare your output to Genbank annotation using gffintersect.pl:
-
chmod +x gffintersect.pl; gffintersect.pl NC_000964.gff output.gff >intersected.gff
- Check out the options associated with
gffintersect.pl script by typing gffintersect.pl -h
- How many genes are (i) annotated in Genbank, (ii) predicted by Glimmer?
- How many of the Glimmer-predicted genes have a significant overlap (50% or more) with the Genbank annotations?
Homework
Do both of the following homeworks:
|