By the end of this lab, you should have:
- visited some of the more commonly-used databases in computational biology
- know how to search for information on these databases
- gotten an idea of the huge amount of biological data that's out there!!
0. Before we get started...
- This lab only requires the use of a web browser. To make things easier, it might be good to have two browser windows open - one kept on this page so you can follow along the instructions and the second one to do the actual navigating. Alternatively, you can open up a new tab in Firefox by using the File->New Tab menu option.
- This lab is actually pretty short if you just followed the instructions below. But since we obviously can't cover everything about any particular database, you should take the opportunity to look/play around on some of the sites that look interesting to you. What are all the search options? Are there tutorials that might come in handy later? (For example, the Pubmed tutorial?) How do you get additional help?
1. Finding information
Most likely many of you, especially the biologists among you, have heard of a protein from jellyfish called GFP (or Green Fluorescent Protein) before. It's one of the most indispensible proteins in biology today as a reporter for all sorts of things, like protein expression, protein localization, etc. Since it's such a useful protein, let's see what we can find out about it, using all the websites and databases we have at our disposal.
So let's start with some general information about GFP. We probably don't need to teach you how to use Google or wikipedia, so we won't. But aside from those two sources of information, we should also think about looking into scientific literature databases such as PubMed:
- Open up a browser and go to NCBI's PubMed site
Whoa! So many listings! And if you check out the titles of some of them, "fluorescent protein" is not even in there!?! But before you call up NCBI to complain, remember that fluorescent proteins are widely used as reporters in biology, so lots of papers probably are related to the topic of "fluorescent protein", just by virtue of having used one in some of their experiments. But for our purposes, that's not too useful, so we should refine our search a little bit. Sometimes, researchers will write review articles that will summarize a lot of findings - nice! To view only the review articles related to fluorescent proteins, simply click on the
- Type in a search phrase - something like "fluorescent protein" should be good - into the textbox at the top of the page
Review tab a little bit below the search box.
Again, we've got lots of titles, but if you look closely at the 2nd line of each result, you'll notice that they're all marked as reviews. But what if we still don't want to read through all these articles?
Maybe we can find a review article in a journal we like, say Science. But if we just tried the search phrase "fluorescent protein science", we won't actually be searching in the journal Science. This has to do with the way PubMed searches are done (you learn about this in the PubMed tutorial), so we actually have to qualify our search term.
The [ta] immediately following the word Science specifies that we want the journal title to be "science". Now we see that someone wrote a review on fluorescent proteins and got it published in Science a while back. That would be a good place to start to look for information.
Instead of relying on shortcuts like [ta], we can also turn to the
- Search for "fluorescent protein AND Science[ta]", then click on the review tab. Note that the AND really does need to be in all caps for this to work properly.
Limits tab of the search page. Here, we can specify a number of useful restrictions on our search such as publication type. In addition, we can use the
Advanced search to build our search string specifying keywords, title, author and/or journal.
PubMed is probably the literature database that most biologists use. But you should be aware that since it's targeted at that audience, it only has articles from biology- and medical-oriented journals. Fields outside of biology have different ways of searching their fields' journals and we won't be discussing them here.
But have no fear.... you can always use Google Scholar, which searches a very broad range of journals, and provides the additional advantage of allowing much craftier searches, if you know what you're doing (look up "google search hacks" sometime for a flavor of what you can do ...).
- Click on the
Advanced search tab,
All Fields should already be selected in the
Search Builder, type "fluorescent protein" in the text box, then click the
Add to Search Box button. Now select
Journal from the
Search Builder drop down menu. You will see that as you begin to type the journal name "Science" suggested journals will appear. Select the first suggestion, "Science (New York, N.Y.)" and click the
Add to Search Box button. You have constructed your search string and can now click
Search in the
2. Finding sequences
OK, we've found some general information about GFP (or at least we know where to look for some). But what about its sequence? To find sequence data, there are a couple of sites we can use. First, let's try out the NCBI Nucleotide database.
Again, we get way too many results to be useful and to filter the results further, we should add more search terms. Let's add the organism from which we got GFP, Aequorea victoria, and search again.
- Search for "GFP", and select the "Core Nucleotide" results (this is selected by default).
- Somewhere around the sixth page in the list of results, you should see a promising match, something like A.victoria mRNA for green fluorescent protein. Take a look at that entry and check out what sort of information is available. The format of this data is currently the GenBank format, which is discussed in class. A few key things to note about the Genbank record:
- The first line labelled LOCUS tells you that this is an mRNA record, gives the date the record was entered into the database, and specifies the number of basepairs as "714 bp".
- The SOURCE and ORGANISM fields tell you the name of the organism and its taxonomy (species and so forth)
- The FEATURES section gives more information about the sequence. In this case, you will see in the CDS (coding sequence) field, the amino acid translation of the nucleotide sequence. For DNA sequences this features section can be lengthy and includes the locations of introns and exons for a gene.
- Finally, the ORIGIN section gives the nucleotide sequence with the numbers corresponding to nucleotide position.
- Using the Display dropdown box at the top, you can change to various other formats, FASTA included. The nearby dropdown box allows you to show the data as text only or to save it to a file.
Let's check out another site where we can get sequence information: the EMBL database in Europe.
- Go back to the GenBank format and find the field titled 'ACCESSION'. Copy this accession number for GFP because we'll use it soon.
You'll probably notice that the entry you get back has pretty much the same information as the site from NCBI. That's because NCBI and EMBL (as well as the DNA Databank of Japan) work as a team to collect sequence data, and then share them with each other daily.
What about protein sequences? Of course, one way to get the protein sequence is to just translate the DNA sequence you got above using a codon table (which has been done for you in the EMBL site). But another way is to utilize some of the protein sequence databases that are out there. (It's not just out of pure laziness that we want to do this either, since the protein sequence databases give you lots of other data as well). As you might expect, NCBI also has a protein sequence DB.
- Go to EMBL and using the accession number you got from NCBI, search for the GFP sequence. Expand the "Nucleotide Sequences" section, and chose the EMBL-Bank option. Finally, click on the accession number.
Reading through the entry is sometimes very helpful. For example, can you see the difference between this protein sequence for GFP and this one? (Hint: Look at the comment line]
Another popular protein sequence database is UniProt, which is actually a central repository that merges the data in three previously separate databases: Swiss-Prot, TrEMBL, and PIR.
With sequence in hand, we will be able to do a lot of things like looking for homologs, aligning sequences, etc. But of course, the sequence of a protein is not its whole story.
- You should get another list of results but if you look carefully, they're different from the DNA sequence results. Click on one of them and check out what kind of information is available in a protein sequence.
3. Finding structures
Protein structures are usually solved by labs using methods like X-ray crystallography and nuclear magnetic resonance (NMR) and then submitted to public databases. One such database is PDB. PDB gives you a lot of information about what a protein looks like when it's folded, and if you have the right software installed (e.g. Rasmol), you can even virtually explore the structure of a protein in 3D.
The summary page gives you some information about the protein in general and the method by which the structure was solved. Most often, we're interested in seeing the structure
- Go to PDB and search for GFP. You can either type in "GFP" as the search phrase, which will give you another long list of results to look through, or just use GFP's PDB ID: 1EMA.
Since the DECF computers should have standalone protein structure viewers like Rasmol or Chimera installed, you could also download the PDB file and look at the structure in there. The standalone programs are nice because they have more features and allow more flexibility in viewing the structure (coloring, etc).
You'll notice that GFP has a sort of cylindrical shape. In fact, it has a beta barrel fold, which is mentioned in lecture. Beta barrels are composed of a bunch of beta strands that line up as a sheet and that's then folded back up upon itself to form a cylinder. Neat!
Another database with structural data is BioMagResBank, which contains only structures solved by NMR.
- On the right hand side of the Structure Summary page, you should see a picture of the structure. Click on one of the Display Options, for example Jmol or SimpleViewer. These applets allow you to rotate the structure by using your mouse (hold the left button down as you move the mouse). Note that browser plugins are required for the some viewers.
4. A whole genome
Finally, it's sometimes useful to be able to look at whole genomes. Unfortunately, as far as we know, Aequorea victoria's genome hasn't been sequenced yet, so we won't be able to follow our favorite GFP protein anymore. But there are many many genomes that have been sequenced, yours included (well, not yours but humans). A couple of common genome sites are listed below. Try looking at the genome of your favorite organism on each site and seeing how each site displays the genome, etc. (If you can't think of one, you can use an interesting little bacterium Vibrio fischeri.)
In addition to the above sites, a lot of organisms have their own specific sites, usually hosted by lab(s) that are working on that organism. One list of such sites can be found at the 123Genomics website. Sometimes, sites dedicated to a particular organism or a set of related organisms are particularly useful, as they present very well linked and annotated data. Let's see what that means with a practical example:
- Go to the Entrez Genome and search for "Escherichia coli K12". Follow the link for the "Reference genome" labeled "Escherichia coli str. K-12 substr. MG1655".
- Go to the bottom of the page under the "Genome Region" section where you see a box with many green dots inside of it. Choose the "Graphics" link next to where "Nucleotide" is already chosen.
- In the new view, find the "Tools" button, hit it, and choose "Search." Search for AdhE (an enzyme near and dear to many of our hearts). Choose the result so it shows up in the main viewer, close the search box, and then fiddle around with the result. Click on it, zoom in, zoom out, and play around a bit. Note the somewhat minimal information you get from this.
- Now go to the coliBASE website and search for AdhE (by gene name).
- Select the AdhE in Escherichia coli K-12 MG1655
- Note the plethora of information available now! Stay on this page for a bit ...
5. Metabolic and Regulatory databases
So what if you wanted to know exactly what reaction AdhE catalyzed, or how its expression is regulated? Have no fear, EcoCyc is here! You can do this by going to the EcoCyc website and searching for AdhE. If you explore this page, you can discover that AdhE converts an alcohol to an aldehyde or ketone, and reduces NAD+ in the process. You can also see that expression of the gene for AdhE is controlled by something called NarL (just click on it for more information ...)
As demonstrated by the above example, there are not only databases that have genomic sequence information, but also databases that describe what reactions various gene products carry out, and how those genes are regulated. For a more complete list of such databases, see the appropriate 123genomics site. These kinds of information rich, highly structured databases are only possible because of ontologies (discussed in lecture) that delimit how various entities interact, and what can happen when they do.
Given the plethora of information out there, it's important that you familiarize yourself with as many resources as you can, so you have the right tools at hand when you want to do some interesting science/computations!
Preview of next week
Next week, we are going to begin programming in earnest. Our language of choice will be Python, a language quickly becoming the primary language for bioinformatics purposes.
While Professor Holmes will discuss high-level programming concepts in lecture, he will not delve into much of the nuts and bolts of actually writing Python code; that will come during lab time.
To prepare for this, I strongly encourage you to take some time before next week's lab and go through a basic tutorial and acquaint yourself with Python. The website Codecademy, an instructive and interactive programming tutorial website, contains a Python track with a handful of introductory lessons. While I will try to give you some of the basics in the lab, going through at least the first two or three (preferably three) of these tracks will be highly advantageous to understanding the basic nature of Python and how to use it. If you have never programmed before, I strongly recommend you go ahead and work through these on your own time. They are not too long, they are very user-friendly, and being familiar with these concepts will allow you to spend your time in lab next week on more substantive content. Even if you have programmed before but in another language, this discusses a lot of elements of Python in terms of how they differ from other languages - noting those differences will be valuable to you.
If you choose another tutorial or want to begin fiddling with Python outside of the boxes on these websites, there are a few different ways to do it.
- ON THE LAB COMPUTERS: Simply type "python" at the terminal to enter the Python interpreter - essentially a Python command line.
A list of further tutorials can also be found here - these are labeled as being for "non-programmers," though they do vary in level of sophistication. Make sure you choose tutorials for Python 2, not Python 3.
For those with little background, try the ones that say explicitly that they are for those with no programming background or completely new to programming or something like that. If you pick one that's too confusing, forget it and try another one!
-- AngiChau - 11 Oct 2005
-- MohammadAzimi - 08 Sep 2010 (minor edits due to site/URL changes)
-- BenjaminEpstein - 06 Sep 2012
- ON YOUR OWN COMPUTER: You can install Python onto your own computer from the main Python website. Make sure you install Python 2.7.3 if you do this - not Python 3.x. It will come bundled with IDLE - a development environment that has a Python command line (like described above for the UNIX terminal) and the ability to edit and save scripts. For an introduction on how exactly to use IDLE, check out this short but good tutorial from here at Berkeley: https://hkn.eecs.berkeley.edu/~dyoo/python/idle_intro/index.html (Don't worry about the fact that it's written for Python 2.4 - the various versions of Python 2 are not different enough for now to make a difference.)