Pathway Mining
Tools
This practical involves using
getuniprot.pl,
a script that extracts, in FASTA format, the subset of
UniProt that
- has one or more annotations descended from a given GO term; and
- is descended from a given taxon ID in the NCBI taxonomic database.
To run the
getuniprot.pl script, you will also need in your home directory copies of, or symbolic links to, the following files:
The syntax to invoke the script is as follows:
getuniprot.pl 0015837 2
where "0015837" is the
GeneOntology identifier and "2" is the NCBI taxonomic node identifier.
You will compare these protein sequences directly to bacterial genomes using programs like
exonerate and
BLAST.
The general idea is to practise running these tools from the Unix command line, rather than using
web interfaces to the tools (though you can try using the web interface if you like).
Procedure
You can use this lab as a chance to practice some pathway mining skills
(which, depending on your plan, may or may not be relevant to your final project).
- Download, or make symbolic links to, the files required for
getuniprot.pl (the GO flatfile, the pathlist, the GO gene associations, the UniProt database and the NCBI taxonomy database).
- To make a symbolic link in your current directory to a file located at
/path/filename, type ln -s /path/filename . (don't forget the final dot)
- Inspect the NCBI taxonomy database.
- Browse the
readme.txt, names.dmp, nodes.dmp files by typing the following: tar -Oxzf taxdump.tar.gz filename | more
- Type
tar --help to see more tar options
- Identify the numeric IDs of the nodes of the NCBI taxonomy tree associated with the following groups
- Eubacteria (i.e. bacteria);
- Actinobacteria;
- Chlamydiae.
- Using the AmiGO browser, or otherwise, identify the numerical GeneOntology identifiers associated with
- cell wall biosynthesis in bacteria (note that GO uses "sensu Bacteria" to denote that the process is located in bacteria);
- cellulose catabolism;
- quorum sensing;
- pathogenesis.
- For each of the above GO terms, use the
getuniprot.pl script to retrieve all bacterial proteins in UniProt annotated with that term (or its descendants).
- Find and download the following bacterial genomes and extract the genomic DNA sequence in FastaFormat:
- Escherichia coli;
- Bacillus subtilis;
- Bacillus anthracis (Anthrax);
- Chlamydia trachomatis (Chlamydia);
- Neisseria meningitidis (bacterial meningitis).
- For each of the above GO terms and each of the above bacteria, or (at your option) some subset of these, plus any other bacteria and/or GO terms that you are interested in, try comparing the protein sequence directly to all six conceptual translation frames of the genome sequence.
- You can do this e.g. using the exonerate program as follows:
~be131/exonerate/bin/exonerate --model protein2dna PROTEIN.fasta GENOME.fasta, where PROTEIN.fasta is the file of protein sequences and GENOME.fasta is the genome (both in FastaFormat).
- Alternatively, you could use tblastn.
- Each time you run exonerate (or tblastn), inspect the results and keep an eye on the following:
- the number of hits you get;
- the scores of those hits.
- Read the exonerate beginner's guide and manual page to get an idea of some of the command-line options you can use with exonerate.
- Try re-running exonerate with different command-line options. To what extent does this affect (i) the results, (ii) the computation time?
- Review the concept of an operon. Try searching PubMed for "operon prediction". Can you find any (Unix) tools for operon prediction that are either (i) web-accessible or (ii) freely downloadable?
Homework
Do the following homework:

Copyright © 2008-2013 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback