Protein Visualization Lab
- To find and visualize functional residues using ET and DARV, two different methods for amino acid conservation detection.
By the end of this lab...
By the end of this lab, you should be able to:
- do basic protein manipulations in Chimera.
- perform an Evolutionary Trace using ETV.
- perform a DART visualization using DARV.
Part 1: Background
Before we start, anything in parentheses or hyperlinked can be ignored unless otherwise stated.
Protein Structure Data
To analyze the structures of proteins, we need to know the positions of the atoms that make up the protein in a three-dimensional space. The most common method used to do this is x-ray crystallography. (Although, other methods like NMR spectroscopy and electron microscopy are also used. X-ray crystallography is a labor intensive process which you can learn more about here
.) Once the positions of the protein atoms have been determined, this information is distributed to the rest of the world in a data file. The most common format for this data file is the pdb file format. There are other formats like mmCIF (Macromolecular Crystallographic Information File), but pdb is the one you will most likely encounter in the computational biology field. We'll stick with it.
Now that we understand "what" we need, we now move on to "where" do we find it. A major resource for crystal structures in the pdb format is the Protein Data Bank. You've actually seen this database already in the "Biological Databases Lab." Go here
and look at section 3 if you need a reminder. Otherwise, we can go straight to the Protein Data Bank
and get to work. Go ahead and click the link. You may want to open a new window or tab at this point.
We're going to "look" at the Src SH2 domain for our example. The Src SH2 domain is one subunit of a more complicated tyrosine kinase protein known as Src. (Tyrosine kinases, remember, add phosphate groups onto specific tyrosine residues in a protein chain. These phosphates act as signals that activate processes related to cellular structure, cell communication, and cellular growth. Src's connection to cellular growth is what led to its discovery. The src gene was originally isolated from the Rous sarcoma virus, which is known to cause tumors in chickens. Because of this fact, the src gene is known as an oncogene.)
So, let's go ahead and find the Src SH2 domain from the Protein Data Bank. Now we're at the home page of the "RCSB PDB" (RCSB stands for "Research Collaboratory for Structural Bioinformatics, by the way. Find out more here
.) Let's go to the search box at the top of the page and type in "src sh2 domain" to look for our protein. Many of these are Src SH2 domains from other species. Some are from the same species, but have been crystallized by different methods or with different ligands. Many are just redundant structures. We want a structure from Waksman, et al. with the pdb id "1SPS". Search directly for that id now, which will take you to the Structure Explorer page. There you'll find a lot of information relating to the specific protein that was crystallized. Go ahead and browse through it, but what we want is the pdb file. The upper right corner of the center tile, where it says "1SPS," has a little icon that looks like a piece of paper. Clicking there will either display the contents of the file or download the file. If you are only looking at the file right now in your browser--when you'd rather have it downloaded onto your computer--go to the left side of the screen and click on the arrow next to where it says "Download Files." Then click on PDB text file. A download window should appear. Save it into the appropriate folder with a helpful name like "1SPS.pdb," if you're asked. Keeping the extension ".pdb" is a good habit.
Good, now we have a pdb file. If you're interested in another protein, go ahead and search for it the same way as we did for the Src SH2 domain. Now, let's go ahead and see what to do with our pdb structure.
Alright, now we know a little bit about protein structure data. First off, pdb files are text files. This file contains positions in three-dimensional space for every atom in the protein it describes. So, to translate this information into something more comprehensible to our visual senses, we need a program that will turn all the positions into actual points in space. Conveniently (for you), there are quite a few programs that do this already such as Pymol, Rasmol, Chimera, Jalview, and many others. For this lab, we are going to use Chimera.
The following tutorial is an adaptation of the introductory tutorial that can be found in the User's Guide. You may use the original
tutorial instead of this one, if you prefer.
Start Chimera by opening up a terminal window and typing:
Once the basic Chimera window appears, you can also open up the Side View tool, which is useful for scaling and clipping.
... Viewing Parameters
... Side View
. (In some versions Chimera, Viewing Parameters
is called Viewing Controls
Rearranging and resizing windows and tools can be done using the normal drag/drop and corner pull methods that you may be familiar with already.
Alright, let's open our structure, 1SPS.pdb.
Navigate the resulting dialog by moving to the location where you downloaded your pdb file and choose it. The structure will appear in the main window and in the Side View tool; although, it will be much smaller in the Side View tool. In the Side View, move the small yellow square (known as the eye position) and the yellow vertical lines (known as the clipping pane positions) around by clicking on them and see what happens. (Note that the Side View "normalizes" itself, so that the yellow square seems to return to each original position before being moved; even though, the change affects the main window.
Let's simplify the display.
... chain trace only
You should now be looking at only the alpha-carbons (CA) connected as if they were residues. Go ahead and play witht the picture a bit. The left-click controls rotation, and the middle-click controls 2D translation. Remember, you can also adjust the image by changing the Side View tool.
Now, let's thicken the lines.
... wire width
You can select an atom or a bond by left-clicking while holding down the Ctrl key. To make additional selections without losing the previous selections, hold down the Shift key in addition to the Ctrl key while left-clicking. Try to pick out three alpha-carbons, one from each peptide. The selection is highlighted in green, and its contents are reported on the button near the lower right corner of the graphics window.
The Actions menu applies to whatever is selected. When nothing is selected, the Actions menu applies to everything.
Let's label our atoms by atom name.
Now let's label them by residue name and number. (First, we turn off the previous label.)
... name + specifier
Each residue label is of the form:
It is now evident that one peptide is chain A, one is chain B, and the other is chain C. Clear your selections by picking a region inside the main window away from any atoms. (Or alternatively,from the menu: go to Select... Clear Selection.
Let's undisplay the residue labels:
Alright, it turns out the three chains are the same. There's some use in keeping two, but we don't need the third. Let's just hide it for now.
Oops, there still something left. That's the ligand. Let's hide that, too.
Excellent! Now just rearrange the remaining two peptide chains so that they fit nicely in the main window. (Don't forget the Side View tool--it may help.)
Let's color the two chains different colors.
Repeat the process to color chain B yellow. Another way to select an entire chain is to pick an atom or bond in the chain and then hit the up arrow key twice, once to expand the selection to the entire residue and another time to expand it to the entire chain.
There is actually another "chain" in this model, not currently displayed: water. This chain ID was assigned automatically when the structure was read in.
(Alternatively, the water could have been selected using Select... Structure... solvent or Select... Residue... HOH.)
Say we wanted to display all atoms of the A chain only.
... Clear Selection
To show the backbone only:
... backbone only
Only the A chain's backbone is displayed because the A chain was selected when the action was performed.
To display all the atoms and color them according to element:
... Clear Selection
... by element
Generally, each structure opened is treated as a model in Chimera. Models are listed in the left side of the Model Panel (Tools... Inspectors... Model Panel)(In some versions, "Inspectors" is known as "General Controls"). A checkbox in the Active column of the Model Panel shows that the model is activated for motion; unchecking the box makes it impossible to move the model. Checking the box again restores the movable state. Make sure 1SPS.pdb is highlighted on the left side of the Model Panel (if not, click on it) and then click close in the list of functions on the right side. Next, use the Close button at the bottom to close the Model Panel.
Next, try some different molecular representations. They can be translated, rotated, and scaled interactively. Multiple representation types can be combined with each other
and with surfaces (more on surfaces below). Remember that when nothing is selected, the Actions menu applies to everything.
... Clear Selection
Finally, let's have some fun with surfaces. There are built-in categories within structures such as main and ligand; when nothing is selected, Actions... Surface... show displays the surface of main. Surfaces can be translated, rotated, and scaled interactively.
... Clear Selection
A Chimera session may be ended using File... Quit.
Okay, enough with the background information already. Move on to part 2 and let's start looking at functional residues.
Part 2: Functional Residue Analysis
Excellent, now we know a little something about protein structures and how to visualize them. Who cares? What’s the point? Well, there are a lot of things that we can do with structure knowledge and visualization tools. We can look at secondary structures like alpha helices and beta sheets. We can look at tertiary structure and see how those secondary structure motifs interact with each other. We can make pretty pictures and post them on our Facebook
pages. One of the areas that particularly peaks the interest of some is the active site, or more specifically the functional residues. Now, as a concept, the notion of the active site is well understood. It is the “pocket” in which the protein’s ligand binds. However, when asked on a per residue level, the question becomes a bit more ambiguous. Are functional residues those that interact directly with the ligand? Are they the residues that form the “pocket”? Are they all the residues that “live” within a certain proximity of the ligand? Also, from an in vitro perspective, the added ambiguity of multiple ligands creates more complexity. So, all these questions are important. And one way of answering this question of functional importance is through the notion of sequence conservation or mutation rate.
In this lab, we're going to take a look at two methods of active site visualization.
Evolutionary Trace (ET)
Evolutionary Trace (ET) is a method originally developed by Lichtarge et al.
to detect functionally relevant amino acid residues. Since its publication in 1996, Olivier Lichtarge's lab at the Baylor College of Medicine have used and refined the method. You can read the abstract to the original 1996 paper here
ET relies on two assumptions about functionally important amino acid sites. One, these sites maintain a constant or near-constant position in the protein. And, two, these sites mutate more slowly than non-functionally important amino acid sites.
To implement the ET method requires a multiple sequence alignment (MSA) and a tree for a protein of interest as inputs. Once you have an alignment and a tree. You can perform a "trace" on the data. The figure below shows you what the tree will look like. Don't worry if you can see the details. This is just to give you some idea of you're starting materials.
Lichtarge(1996) Figure 1:
This next figure demonstrates how a "trace" is actually performed. First, take a look at the tree. You'll notice that it is separated into groups using different colors. Each of these groups is a functional group. In the figure below, three are chosen and numbered in the first column. Notice that there are four proteins in the first and third groups, and three sequences in the second group.
Lichtarge(1996) Figure 2:
In the second column, the sequences for each group are condensed into a single consensus sequence. All columns that show no variation are labeled with the respective residue. For example, in the first column of the first group, "A" (for alanine) shows no variation. So, the column is labeled with an "A" (in red). The same is true for the next column, "E" (for glutamate). In the third column, the fourth sequence has a "K" instead of an "R". The variation in this column is recorded as an underscore (_) in the consensus sequence.
Once a consensus sequence is constructed for each group, the next step is to construct a trace. First, you align the consensus sequences, and then you look for variation in the column. This is similar to how the consensus sequences were constructed, but there is a slight change. Just as for consensus sequence building, you label each column that shows no variation with the respective label. For example, the "T" and the "K" are both invariant. We say that these residues are "invariant." So far so good, nothings different. But the next step is a little different. Here, we look at any column that contains only amino acid residues (no underscores or periods) and we label those columns with "x"'s. These residues are known as "class-specific." We ignore any columns that have non-residues in them.
Finally, the conserved (blue) and class-specific (red) residues are mapped onto a the appropriate structure.
And that's essentially it. You now know how to do an Evolutionary Trace on a protein.
A Real Trace
Great! Now, let's do one. Don't worry, it's not as bad as it looks. We're going to let the program we'll use do the alignment and tree building for us, but we still need to get a list of sequences to align. One way to get these for our SH2 example is to go to the SH2 PFAM profile
and download the seed multiple sequence alignment for the profile in fasta format (without gaps - you have to go to "further alignment options"). You also need the to paste the fasta sequence from the 1SPS pdb file (available on the RCSB page for 1SPS under downloads) at the top of the fasta file, rename it to just "1SPSA", and delete the first residue, Q (the first residue doesn't appear in the crystal structure, and this confuses the trace program we'll be using). Or, you can grab the prepared file from here
Now, we'll go here
to the Evolutionary Trace Viewer. This is a user-friendly interface that implements the Evolutionary Trace method. Follow the link, and click on the title "Evolutionary Trace Server 2.0" under the BCM picture.
Go to the Utils
menu and choose the ET Wizard
ETV will ask if you are running Evolutionary Trace via a local or a remote server. Choose Remote
(default) from the pull down menu and click OK
to go to the next screen.
Then, there will be an access agreement. Choose I agree
from the pull down menu and click OK
to go to the next screen.
The next screen asks for a structure file. Choose the Download PDB File
option and type in the pdb id in the box labeled "PDB code." Now, it's important to specify a chain. For example, 1SPS has six chains. You can only run a trace on a single chain. Since the three main chains are all the same, we just need to pick one. Let's choose chain "A." So, in the "PDB code" box, type "1SPSA" and then click Next
to go to the next screen.
The next screen asks for a custom sequence list. This is where you input the FASTA file you made earlier. Click on Yes
and insert the path to your FASTA in the "Sequence file" box. In the "Query sequence name" box, enter exactly what appears after the ">" for the reference sequence in the FASTA file. In this case "1SPSA" is the name of the reference sequence. Click Next
to go to the next screen.
On the next screen, in the "Download path" box, provide a path to the directory in which you want to save your data. EVT will create a new folder in this directory and stuff it with a bunch of files to be used later. Click Next
to go to the next screen.
The screen offers some "Advanced" features. For now, we'll just use the default settings. (Go ahead and click on the Advanced
button, if you want. You'll notice it gives you options to specify sequence extraction, BLAST
performance, and rate matrix selection, among others. To get back to the previous screen click OK
.) Click on Finish
and let the trace begin!
The ET Wizard will give you some feedback. If you get a message (after a while) that says, "Trace completed successfully," then you've completed a trace. Congrats!
Go to the directory you chose previously, and go find the new zip file that wasn't there before you ran ETV. It will be named after the protein, but then it will be followed by a bunch of numbers, e.g. 1SPSA1159253834909.zip. Extract the files.
Next, go back to the "ET Viewer v. 2.0" window. Click on the File
menu and choose Open ETV Results
from the drop-down menu. Go to the file directory that you specified earlier and look at the folder that you just extracted. Choose the file that has the .etvx extension.
Your protein will appear in the viewer window. Slide the blue cursor around and see what happens. What do you think is happening? Do yo see any patterns?
Look in the ETV generated folder and see what other files might be useful. Remember, we're looking for functionally significant residues.
Once your done looking at your results, move on to the next section.
DART Visualizer (DARV)
Visualizer (DARV) is a functional residue visualization method based on xrate. Originally developed by Klosterman et al.
, xrate is a computational tool that uses probabilistic models to estimate amino acid residue mutation rates. (Actually, xrate is a much more flexible tool that can be used for training and annotating different types of biological sequences using standard or custom-designed models. You can read the abstract to the paper here
. If you're particularly interested in xrate and its applications, you should speak with Prof. Holmes. If you wandering about the "DART" name, you can go here
DARV makes the same assumptions concerning functionally important amino acid sites as ET. Namely, that (1) these sites stay in the same place, and (2) that they mutate more slowly than functionally neutral sites.
Implementation of DARV requires (1) a multiple sequence alignment (MSA), (2) a tree, and (3) a model for a protein of interest. Inputs (1) and (2) should be understandable; they are the same as for ET. Input (3) is new. Let us try to explain.
ET is the implementation of one specific model to determine functional residue sites. The model divides groups of sequences into functionally-similar categories and identifies conserved residues among these categories as functionally important. Xrate is a more general tool and sequence analysis can be done on any number of models. The model you will use creates a variable number of rate bins in which each residue belongs to one and only one bin. The bins are ranked from slowest to fastest, where the residues in the slowest ranking bin are predicted to be functional residues.
Here's a picture:
The details get a bit dense, and probably unnecessarily so. So, we'll see if we can just skip them. But, we can't leave without a little bit of explanation. So, keep in mind that there is a sorting algorithm that determines how quickly each amino acid residue is mutating and places each amino acid residue in the appropriate rate bin. Okay, that's it. Now, let's try it.
Before we start, however, we need to setup environmental variables.
For tcsh (You'll probably use this!), type in the following:
$ setenv DARTDIR ~be131/dart
$ setenv PATH "$PATH":~be131/dart/bin:~be131/dart/perl
$ setenv PERL5LIB ~be131/dart/perl
For bash, type in the following:
$ export DARTDIR
Don't know what shell you're using? Type:
$ echo $SHELL
Then, follow the directions that are appropriate.
Good. Now, we're ready. Let's make the model. There's a program that will do that for you. The only thing you need to specify is the number of bins you would like. For starters, let's try two bins.
$ ~be131/xgram-viz/phylohmmgenerator.pl -n 2
If everything is working properly, this will result in a lot of output. You should be seeing lots of parentheses and the word "mutate" repeatedly. The last line should also have the words "end alphabet Protein" somewhere.
If that works, then you're on the right track.
The next step is to save this output to a file. Before we do that, make sure you're in the directory where you would like to save your files. If you're not, go there!
Good, now let's use seven bins this time; it makes for better results. Why do you think seven bins are better than two?
To make a seven bin model, type the following:
$ ~be131/xgram-viz/phylohmmgenerator.pl -n 7 > mut7rates.eg
where, "mut7rates.eg" is the name of the file you want to save. You can name the file whatever you like, just make sure you keep track. Also, keeping the .eg extension is a good habit.
Remember, there are three things you need to perform this analysis. We just generated the model. The other two are a multiple sequence alignment (MSA) and the tree. The multiple sequence alignment is available at
. For future reference, if you're MSA is not in stockholm format (for example, muscle defaults its output as aligned fasta (.afa)), then you need to convert it to stockholm before we continue. There's an online program here
that will do it for you. Follow the instructions on the page.
Assuming that everything is in the right format, now we take our newly created model, our properly formatted multiple sequence alignment, and we load them into xrate, which is the sorting algorithm mentioned earlier. This will distribute our amino acid residues into our rate bins. Run xrate by typing the following:
To make a tree (this takes a bit to run - you can also just get the output file from the ~be131/labProtViz/ folder:
$ ~be131/dart/bin/xrate -e ~be131/dart/grammars/nullprot.eg --grammar mut7rates.eg --noannotate sh2.stk > sh2_tree.stk
$ ~be131/dart/bin/xrate --grammar mut7rates.eg --annotate sh2_tree.stk > sh2_tree_annot.stk
So, now we have an annotated stockholm file. Annotated, in this sense, just means that each amino acid has been given a label that identifies its proper rate bin.
We need to tidy up a little bit before we can get to visualizing. Remember the reference sequence problem we had earlier with ETV. The same problem exists here. So, make sure you know the exact name of your reference sequence, otherwise this might not work. Then do this:
$ ~be131/xgram-viz/xgram2dgc.pl --reference 1SPSA sh2_tree_annot.stk > sh2_tree_annot_tidy1.stk
where, 1SPSA is the exact name found in the alignment, sh2_tree_annot.stk is the output file from the previous step, and sh2_tree_annot_tidy1.stk is the output of this step.
We need one more tidy step:
$ ~be131/dart/perl/drop-gappy-columns.pl -g REFSEQ sh2_tree_annot_tidy1.stk > sh2_tree_annot_tidy2.stk
where, sh2_tree_annot_tidy1.stk is the output from the previous step and sh2_tree_annot_tidy2.stk is the output for this step.
Alright, we're almost there. Run this program to generate a Chimera-readable file.
$ ~be131/xgram-viz/chimera-parser.pl -n 7 -c A -o chimera --structure 1SPS.pdb sh2_tree_annot_tidy2.stk > MutRateA
where, sh2_tree_annot_tidy2.stk is the output file from the previous step, and MutRateA is the output from this step. Note that "1SPS.pdb" should include the entire path to the pdb file.
This file can be loaded directly into Chimera. So, go back to your Chimera window, or relaunch it, if necessary.
First thing we need to do is load a structure.
Then, click on the 1SPS.pdb file.
Once the structure is loaded. Define the attribute MutRateA
... Define Attribute
The attribute MutRateA is no defined. Now, color according to this attribute.
... Render by Attribute
A new window will pop up. Change the "Attributes of" box to residues
. Then, from the "Render" tab, select from the "Attribute" box MutRateA. Three bars will appear in the histogram below the "Attribute" box. You can adjust these to change the coloring system. You can also change the color of the bars by clicking on the bar and then clicking on the "Color" box below the histogram.
From this analysis can you identify the predicted functionally important residues? How do these predictions compare to the Evolutionary Trace method?
- 17 Nov 2008
Copyright © 2008-2013 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback