|
| -- DanielYang? - 07 Dec 2009
Protein Alignment Tutorial
Feel free to contribute/edit/correct anything.
This is all stuff I learned from Professor Sjolander's Bio E 144? Class
Some of the protocols are taken from her lab practicals.
1. Multiple Sequence Alignments
Multiple sequence alignments can be constructed given a few sequences to examine the conservation levels at particular regions/ amino acids. There are several algorithms out there to construct MSAs. I would suggest running either SATCHMO or MAFFT. Both are very easy to use and have online versions that require no pre-installation.
1. SATCHMO: SATCHMO uses a progressive alignment, meaning that it will align a sequence to your query, then align another sequence to the MSA you have just constructed, and so on without going back and changing any of the previous MSAs.
- Online server available at http://phylogenomics.berkeley.edu/cgi-bin/satchmo/input_satchmo.py
- Input: several FASTA files of sequences you would like to align (simply cut and paste into the input box)
- Output: a MSA. You can download this file into a text editor and save it as FILE.satchmo. If you are on a DECF machine. You can view and edit this alignment using a pre-installed alignment viewer called Belvu. At the command window type:
belvu FILE.satchmo
2. MAFFT: MAFFT uses a combination of progressive and iterative approaches. It first produces a progressive alignment as the intial alignment. Afterwards, it will go back and re-edit the alignments iteratively to produce refined MSAs. The performance difference between SATCHMO and MAFFT are not significant for our purposes.
- Online server availabe at http://align.bmr.kyushu-u.ac.jp/mafft/software/
- MAFFT is pre-installed on the DECF machines. To run mafft. You should first save the FASTA files you wish to compare into a single sequence file. Just open up a text editor and copy and paste in. Then save as NAME.seqs
- Now, to run MAFFT. at the command window, type:
mafft NAME.seqs > NAME2.mafft
- The produced .mafft file is a postscript file that you will again be able to view in Belvu
2. Gathering Homologs
I would suggest using BLAST or PSI-BLAST to gather homologs. The online version is available at: http://blast.ncbi.nlm.nih.gov/Blast.cgi
For protein alignments, you'll want to select the protein blast. This will lead you to a page where you can paste in your query sequence. Just copy and paste in your sequence in FASTA format. Under algorithm selection, I would suggest using PSI-BLAST instead of BLAST. The difference between PSI-BLAST and BLAST is that PSI-BLAST runs multiple iterations of BLAST. This allows you to find sequences that are not immediately alignable to your query sequence, but may be alignable to some intermediate sequence between your query and the sequences returned by PSI-BLAST. This lets you find more distant homologs that what you would discover using only BLAST. However, you don't want to run PSI-BLAST for too many iterations as you may run into the problem of profile drift or you will be gathering sequences that do not really have a meaningful homologous relationship with your original query protein.
If you would like to gather the homologous sequences you've found using BLAST into a single file. You can do this by running BLAST from the command prompt on the DECF machines. The command to do so is as follows:
blastpgp -i QUERY.FASTA -o HOMOLOGS.OUT -m 9 -j 3 -e 0.0005
UPDATE (12/8/09): I tried running PSI-BLAST through the command prompt. This step takes quite a while. Depending on your sequence, it may take ~30 mins to finish.
This means we are using a input/query sequence of QUERY.FASTA and blasting against the default NR database using
3 iterations and a e score cutoff of 0.0005. The gathered homologs will be output in a postscript file which we have named HOMOLOGS.OUT in the format specified using -m 9. I don't really understand the exact details of this format, but what we are really interested in are the sequence IDs only, we can extract these out using:
cut psi.out -s -d '|' -f 6 > HOMOLOGS.IDS
So at the end of this, you will get a postscript file named HOMOLOGS.IDS which contains the IDS of all your gathered homologs. So Now we can use the fastacmd (which we are familiar with from our Bio E 131 labs) to get the fasta sequences for these homologs and store that in another postscript file.
3. Constructing HMM
In the case when we have only a few sequences to align. There may not be too much meaningful information to be obtained from only aligning the sequences to each other. Instead, we can approach the alignment using a more sophisticated approach by building Hidden Markov Models (HMM). Hidden Markov model (Wikipedia) We may be able to build more meaningful alignments using this approach.
In this tutorial, we will use an iterative approach to build an HMM from an initial seed sequence. You will first need to gather homologs using PSI-BLAST as I've mentioned earlier The mathematics of building a HMM is complicated. Luckily, there's a pre-installed program on the DECF machine that will build HMM for us. The program is SAM w0.5. You can run it using the following command.
w0.5 input.fasta iter0.mod
Note that all the files we are dealing with here are post script files. I am simply choosing to call them .fasta to remind me that its a fasta sequence and .mod to remind me that its an HMM model. Now, you can use hmmscore to score the HMM against a database to evaluate your HMM. The command to do so is.
hmmscore iter0 -i iter0.mod -db homologs.seqs -sw 2 -dbsize 100000 -select_align4
The -sw 2 arguments indicates that we will be using local-local alignment, which allows us to look for conserved domains/regions. Running this will give you 2 post script output files: iter0.dist and iter0.a2m. You will now use the iter0.a2m as the input file to run w0.5 again to produce a new model and then again use hmmscore to score the model against your database of homologs. You can iterate this process multiple times to produce a HMM. Your HMM will be refined with each iteration. However, 3-5 iterations are generally enough to build a decent HMM.
After running the iterations (say 3), you can now align your other sequences of interest to the HMM to see if there are any conserved regions. You can do so using the align2model command. First though, it will be easier if you save all your sequences into a single file (ex. PROTEINS.fasta) The command for align2model is as follows:
align2model iter3-proteins -i iter3.mod -db proteins.fasta -sw2
This will return a postscript file titled iter3-proteins.a2m. This is your alignment to the HMM! You can now view and edit this alignment in Belvu in the same way as a multiple sequence alignment
4. Infering Information
From the alignments we can also infer function such as possible motifs, domains, and catalytic sites by examining the conservation patterns and also taking into account our prior knowledge about biology. There are also many prediction servers out there for domain architecture, catalytic sites, structure, etc. If you are interested, PFam (http://pfam.sanger.ac.uk/), which is a database of domains based on homology from MSAs and HMMs, is a good one to look at for domains. PHYRE (http://www.sbg.bio.ic.ac.uk/phyre/) is a good one to look at for structure prediction. However, keep in mind that bioinformatics analysis rely on principles of homology. However, we cannot safely infer function based on sequence homology alone. In the case of structure prediction, where there are no close homologous PDB structures, PHYRE predictions can be way off/meaningless. |