By the end of this lab, you should:
- be able to find similar proteins to your protein of interest
- have an idea of how to perform sequence alignments using available online tools
- know where to look for further help on using these tools
Scroll down to the end of the page for the (two-part) homework.
0. Before we get started...
- We need to have some sequences to work with in this lab. In the Biological Databases lab, we learned how to look up sequence information in one of the many databases available online. Look up the protein sequences of the genes of your choice and download them to your home directory as FASTA files.
Why not the nucleotide sequence? Well, many different triplet codons translate to the same amino acid and what ultimately determines the structure and thus, function of the protein is its amino acid sequence. So most of the time, it makes more sense to align protein sequences rather than nucleotide sequences. Obviously, this is not a strict rule -- there are definitely times when you want to align nucleotide sequences, e.g. for non-protein-coding sequences, regulatory regions of DNA, etc etc.
1. Finding similar sequences using BLAST
First of all, why do you want to find similar sequences? As an example, say you stumbled upon a new protein and somehow got a hold of its sequence. The only problem is that you have no idea what the protein's function is! One way to start figuring out the function using bioinformatics (instead of wet-lab experiments) is to find a bunch of other sequences that are similar to your protein. If other people already figured out the functions for that set of proteins, then you can infer
that your protein probably has a similar function, purely based on sequence similarity. Obviously, this is not a guarantee that your protein actually has that function (the ultimate proof is through experiments), but it's a pretty good start....
In lecture, we've been talking about aligning two sequences and assigning scores to an alignment such that it will give us information about how similar the two sequences are. We've also looked at the Needleman-Wunsch algorithm and started investigating the basis for the scoring schemes. While the Needleman-Wunsch algorithm is relatively easy to implement using dynamic programming, as Prof Holmes mentioned, it is quite time-consuming and memory-intensive to run on sequences of any significant lengths.
Yet, all bioinformatics students learn about this algorithm, because it gives a good intuition of how alignments can be performed and more importantly, what the scores mean. When you want alignments of "real" sequences, you would probably turn to some of the already available alignment tools. (But the idea behind the scores still hold - so that's why it's very important to understand what alignment scores means!)
Most of you have probably heard of BLAST
as a program for finding similar sequences, available online from the NCBI website. BLAST is linked to many of the sequence databases, so when you enter a query sequence, it essentially performs many local
alignments for you against the sequences in the database you choose and gives you the set of highest scoring alignments. Obviously, it's not performing thousands and thousands of Needleman-Wunschs or you would never get a result! In fact, BLAST uses a heuristics-based approach by first splitting up the sequence you enter into a bunch of "words", scanning for these words in the sequence database, and then using the search results to seed alignments extending from the found words. You can learn a lot more about the actual workings of BLAST this chapter of the NCBI Handbook
Having an available search tool available doesn't necessarily mean life becomes simple! Look at all the choices you have before you even enter a query sequence! Luckily, the people at NCBI have spent a lot of time writing documentation on BLAST, so check out this table
to see what the differences between all the various BLAST programs are. For now, just focus on the ones for protein queries.
- Since our goal here is to look for sequences similar to the query sequence and our query sequence is most likely longer than 15 residues, protein blast (blastp) seems like a decent choice. So select protein blast (blastp) from the table (or from the main BLAST page).
- You should now see a page with a form containing many boxes for you to fill in. The main text box labeled 'Enter accession number, gi, or FASTA sequence' is where you put in your query sequence. Not sure what kind of format the sequence should be in? Click on the link next to 'Enter accession number, gi, or FASTA sequence' for an explanation.
- How do we tell BLAST where to look for similar sequences? That's specified by the database we use for the query, chosen by the 'Database' dropdown box. To learn what the different databases are, click on the link next to the 'Database' dropdown. Choosing an appropriate database is an important step, since the results you get completely depend on which databases you choose to query. If you only cared about finding similar sequences in human, you may want to use the 'Organism' text box to narrow down or exclude organisms rather than searching the entire database, which would search all sequences regardless of organism.
- There are many other settings for blastp and I suggest clicking on the explanation for each to find out what that setting is for. If you're really itching to do your first alignment, just leave everything at its default value and click on the 'Blast' button.
- Once you submit your request, you'll be directed to a page with your Request ID. Wait a bit to let your request go through the queue. By default, BLASTP will also run a search against the Conserved Domain Database (CDD) and display the results graphically while it performs the Blast search. The graphic shows protein domains that may be present in your query. Click on the graphic to get more information about the search results. Here's a page with help on CDD.
- When the server finishes processing your request, it'll show you the set of sequences it found. Congrats! You just performed your first BLAST search! Below where the results show your query and its length is a graphic titled "Distribution of BLAST Hits". This graphic shows you at a glance what were the significant matches and where they match up with your query. You should be able to tell from this whether the matches are local or global with respect to your query. These matches are also referred to as "target" sequences. Scroll down a bit and you'll see a big list of whole sequences or sequence fragments that matched your query sequence.
- Associated with each sequence is a 'score', which is the pairwise alignment score between that sequence and your query sequence.
- You'll also see something called an 'E-value', which is the 'expect value', and essentially tells you how statistically significant your alignment is (more here). The E-value is calculated from the length of the query sequence and the database size. The smaller the E-value the more significant the match. A rule of thumb is that E-values less than 1e-3 are significant matches.
- Below the line showing the Expect value are three numbers identified as "Identities", "Positives", and "Gaps". Identities gives the number of exact matches between the query and the target sequence over the length of this alignment. A general rule of thumb for %id is that it should be 25-30% over an alignment of at least 80-100 amino acids to be able to assert that the sequences are homologous, meaning that they are evolutionarily related. If sequences are homologous, then you have a better case for asserting they share a common function. "Positives" takes into account both exact matches and conservative substitutions (ie, similar amino acids). "Gaps" refers to the number of gaps in the alignment. Of course, the lower the number of gaps, the better the alignment.
- By default, BLAST will filter out low-complexity regions which have biased amino-acid composition (eg, sequences of repeated amino acids) because they will skew the results. In the output, BLAST will display these regions grayed out and in lower case.
Note that the alignment scores are based on the scoring matrix that BLAST used - for protein sequence alignments, this is a very important setting and how you choose an appropriate matrix depends on many things, including what kind of sequence similarities you want to detect, how long your sequences are, etc. Here's a longer explanation
of substitution matrices.
(If you're interested in a much more in-depth look at sequence alignments, here's a great BLAST tutorial written by Dr. Stephen Altschul, one of the main guys behind BLAST.)
The gene DCC
is deleted in colorectal cancer and is located on human chromosome 18q21.3. It encodes for a tumor suppressor protein. Expression of the gene is reduced significantly in most colorectal carcinomas. The protein sequence of human DCC has the Refseq accession number of NP_005206.
- Locate the Genbank record for this protein. Note the length of the amino acid sequence.
- Perform a BLASTP search using this protein as the query and Swiss-Prot as the target database. Limit the search to mammalian species only and use BLOSUM62 as the scoring matrix.
- The DCC protein from human is most closely related to the DCC protein from what other mammal?
- What percent identity do they share?
- What is their percent similarity?
- What is the length of the alignment? Were both proteins aligned along their entire length?
- Does the DCC protein contain any low-complexity regions that have been masked out by BLASTP? If so, where?
- Look at the results for protein with Swiss-Prot id of P97798.
- What percent identity does it share with the query?
- What is the alignment length? Is it a global or local alignment for the query and the target?
- Would you say this protein is homologous to the query?
- Look at the results for protein with Swiss-Prot id of O60469.
- What's the %id and length of the alignment? Can you assert homology?
- Is this a global or a local alignment with respect to the query? How about with respect to the target?
- Based on the BLASTP results, can any general observations be made regarding the putative function or cellular role of DCC?
2. Pairwise alignment using Exonerate
While BLAST is great for many applications, sometimes what you really want to do is align two sequences you have. Exonerate
is a great tool for this purpose. It can align sequences of nucleotides or amino acids. For this lab, you can use the copy installed in the
home directory. For example, if you have two FASTA files in your home directory named
, you can perform a very basic ungapped alignment of them by:
$ ~be131/exonerate/bin/exonerate one.fasta two.fasta
There are many many options available for running Exonerate (see the manual page
), including running the full exhaustive alignments like the Needleman-Wunsch example we did in class, specifying a gap penalty model, etc etc. Try out a couple of these options and see what happens. The beginner's guide
and advanced guide
are great resources for learning about Exonerate. (If you want to play around with Exonerate more at home, it should be pretty easy to install since they have binaries for most operating systems on the site.)
Try aligning the human DCC protein and the protein with the Swiss-Prot id of O60469 from the BLAST exercise.
- First produce a gapped Smith-Waterman-Gotoh alignment by typing:
$ ~be131/exonerate/bin/exonerate --model affine:local one.fasta two.fasta
- You will see all the fragments Exonerate can align. Exonerate is running in heuristic mode and the matches it picks up are not high-scoring enough. You can run it in exhaustive mode to get the highest scoring alignment:
$ ~be131/exonerate/bin/exonerate --model affine:local one.fasta two.fasta --exhaustive > exonerate.out
- Take a look at the positions of the highest scoring alignment. How does it compare to the alignment returned by BLASTP?
3. Multiple alignment using MUSCLE
The natural extension from pairwise alignment is to do alignment of more than 2 sequences; ie, multiple alignment. It is possible to use Exonerate and other pairwise aligners to do multiple alignment by wrapping them in a custom program. However, the easy way is to use a program already built for this purpose. One example is MUSCLE
, a fast multiple alignment program strictly for protein sequences. By default MUSCLE will produce a gapped alignment. Here's how you call MUSCLE to align sequences in FASTA format in file input.fa and output them in output.fa:
$ muscle -in input.fa -out output.fa
There are many other multiple alignment programs. Some popular ones are ClustalW
. A newer program that reportedly gets good results is PROBCONS
Myosin is a protein that is important for muscle function. Myosins from several organisms have been sequenced. Download the following protein sequences in FASTA format and place them in one file: MYH1_HUMAN, MYH3_RAT, MYH3_MOUSE, MYH4_RABIT, MYH6_MESAU, MYH6_MOUSE, MYH7_PIG, MYH9_CHICK. Construct a multiple alignment of these sequences using MUSCLE. View the multiple alignment in Belvu (we covered Belvu briefly in the UnixTutorial
- What is the length of the alignment?
- The sequence for MYH3 from mouse is a sequence fragment. Does this fragment align to the N- or C-terminus of MYH3 from rat? What is the %id between these two sequences? (Use the "Sort->By identity to highlighted sequence" menu option. Right click on the Sort button to see the menu.)
- Which two sequences in the alignment are most similar to each other? Which are the least similar? (Use the "File->Compare all vs all, list identity" menu option. The output will appear in the command line.)
- The human sequence is most similar by %id to the sequence from which organism? Use the "Sort->By tree order" to generate a neighbor-joining phylogenetic tree for the sequences. Locate the human sequence in the tree and note the organism to which it is most closely related. Does it confirm the similarity by %id?
- There is a highly conserved block of ten amino acids within the first 200 positions of the alignment. What is the sequence of that block? (In the default Belvu color scheme, the aqua blue denotes the highest conservation, followed by dark blue and gray.)
- Manually remove the MYH3 sequence from mouse. After doing so, how many sequences remain if sequences that are more than 90% related to one another are removed from the alignment? (Use the "Edit->Make non-redundant" menu option.)
4. Protein Databases and Prediction Servers
But what if the sequence you're interested in just isn't similar enough to any other known sequences, so by sequence alignments, you can't really figure out what your protein does? Earlier in class, Prof Holmes mentioned that there are a number of databases which contain information about protein families, domains, motifs, etc. Some of these databases can also help you find similar proteins.
Let's first take a look at PROSITE
. Either by pasting in a query sequence or an accession number of the sequence, you can scan the database to see if your protein contains any of the known protein domains and sites, with the idea that this will help you figure out what the function of your protein is. Depending on what your protein is, you may/may not have detected some binding sites, phosphorylation sites, etc etc.
Another thing you can do with your protein sequence is to try to predict its secondary structure using one of the few prediction servers: PSI-PRED
, and JPRED
. On most of these sites, you submit your protein sequence along with your email address, and when the results are ready, you get a notification email.
Finally, there are a couple of databases which contain classifications of structural elements in proteins, such as SCOP
. These databases try to group families of proteins together by common structural elements, so you can view all the proteins with coiled-coil domains, for example.
Copyright © 2008-2013 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback