- Get a feel for the information content of real genomes, and sources of redundancy (repeats, low complexity, etc)
- Analyze composition (nucleotide, dinucleotide, trinucleotide) and entropy of a real genome
- Dotplot vs self
- Get a feel for microsatellites, repeats, low-complexity, inversions, duplications
- Try compressing a genome
- Use a sliding-window entropy filter to mask out low-complexity regions
- Get exposure to GFF and FASTA formats
- Dotplot program(s)
- Dotter (linux binary, runs locally, no size restriction)
- JDotter (Java, runs on server, has size restriction)
- Dotlet (Java)
- xgraph - download from ISI or Radford Neal's site
- Various Perl scripts linked below, including GffTools
On the DECF computers, the following programs are installed and you may use them directly:
In addition, the following files related to this lab are available in the specified directory to be copied to your own directory:
$ ~be131/xgraph/bin/xgraph (manual pages can be viewed by
$ man ~be131/xgraph/man/manm/xgraph.man)
- In order to use the gfftools, you will first need to tell Perl where to find some of the libraries that gfftools needs. First determine your Unix shell by typing
- In tcsh, type the following:
setenv PERL5LIB $HOME/gfftools
- In bash, you need to type
- Often in subsequent practicals, we will give the tcsh syntax only.
- other Perl scripts for this practical:
- EMBL file for Ehrlichia ruminantium and FASTA file for Bifidobacterium:
- The practical is split into two parts. The first part involves generally playing around with complexity analysis tools including dotplots, compositional analysis, sliding-window entropy and compression. The second part involves using one of these tools (sliding-window entropy) to filter out low-complexity sequences from the genome of a Bifidobacterium.
- The second part of the practical (repeat-masking Bifidobacterium) is quick, but essential for future practicals (wherein we will attempt to predict genes in the same genome). Be sure to finish the second part, even at the cost of not following up every last suggestion in the first part.
- First part.
- Your submitted homework assignment should address all the points in bold typeface.
- Tools for analyzing sequence complexity.
- Familiarise yourself with the concept of a dotplot for analyzing biological sequence similarity. Several free dotplot programs are available. Play around with e.g. the Dotlet examples and ensure you are able to sketch the dotplots for direct repeats, inverted repeats and regions of high or low sequence complexity.
- You may find it useful to examine a highly repetitive genome such as that of Ehrlichia ruminantium.
- The above link is to an EMBL database entry, which has a somewhat idiosyncratic format (like most file formats in bioinformatics); see if you can make sense of it. Use this embl2fasta.pl script to extract the sequence in FastaFormat (most of the tools for DNA analysis that we will be using take FastaFormat as an input, so you'll need to convert from EMBL format in most cases.)
- What is the nucleotide composition of the Ehrlichia genome? (Use e.g. this script, or write something better.) What is the entropy of this probability distribution?
- How about the dinucleotide composition? (Use the switch "-n 2" in the composition.pl script.)
- Use a sliding-window entropy program to scan across various parts of the Ehrlichia genome, including microsatellite and tandem repeat regions. In particular try here, rel="nofollow" here, rel="nofollow" here rel="nofollow" and here. rel="nofollow" Visualize the results by piping them into xgraph. Try playing with the
-w parameters to change the word length and window size (respectively). Compare the results to a dotplot. What sorts of repeat is the sliding-window entropy method good at picking up, and what does it miss?
- Try compressing the entire Ehrlichia genome using a standard data compression tool, for example
bzip2. What's the result? (If you have time, try compressing several different parts of the genome.)
- Second part.
- Review the following background info on Bifidobacteria. If there is something you don't know, consider it a test of your Google skills...
- genome structure of prokaryotes in general
- biomedical relevance of Bifidobacteria
- sequencing and assembly process: reads, contigs, scaffolds
- why are there e.g. N's in the sequence?
- Check out the online annotation:
- Download the genome in FastaFormat
- (optional, might make things easier) Split multi-sequence FASTA file into individual sequences using e.g. attached Perl script splitseq.pl
- As preparation for subsequent exercises in prokaryotic genefinding, you'll need to mask out low-complexity regions from this genome.
- "Masking out" means replacing certain nucleotides with N's or X's. You can use a sliding-window entropy method to do this with the cfilter.pl ("complexity filter") script, which has similar syntax to seqentropy.pl (the other sliding-window entropy script that you've been using above).
- As practise for later work, if you have time, try splitting this process into two steps. First use the '-gff' option (for the cfilter.pl program) to return the co-ordinates of the low-entropy regions in GFF format, then use the gffmask.pl script to mask out the GFF-specified co-ords. Why might this method be (slightly) more flexible?
- sketch the dotplots for direct repeats, inverted repeats and regions of high or low sequence complexity.
If we plot a sequence against itself on a dotplot, you would definitely see a solid main diagonal since a sequence will always match itself. Note also that these types of dotplots will always be symmetrical across the diagonal. Direct repeats will appear as lines parallel to the main diagonal, while inverse repeats won't be detected in this plot. Low-complexity regions will appear as shaded blocks and high-complexity regions will not show any distinguishable features.
If we plot a sequence against its complement (not reverse complement), inverted repeats will appear as lines perpendicular to the main diagonal, although the main diagonal itself won't appear on this plot since a sequence's complement probably won't match its original.
If we plot a sequence against its reverse complement, inverted repeats will appear as lines parallel to the main diagonal, but again, the main diagonal won't show up on this plot.
- What is the nucleotide composition of the Ehrlichia genome?
> ./composition.pl ehrlichia.fasta
# Sequence 'gi|57160810|emb|CR767821.1|'
- What sorts of repeat is the sliding-window entropy method good at picking up, and what does it miss?
Many students had trouble with this question, so let me try to explain the sliding window entropy method a little more thoroughly than just giving you an answer. The following is adapted from an email I sent to some students about this question, so don't be surprised if this sounds eerily familiar!
The sliding window entropy gives you a series of numbers - let's make sure you understand what each of these numbers mean before going further. What this method does is it takes a sequence and calculates the entropy of the sequence depending on the window size and the word size. For a simple example, think about a sequence with 20 bases. If we perform sliding window entropy with a window size of 10 bases and a word size of 1 base, then the method will take the first 10 bases (base 1-10), count up how many times each base (A,C,T,G) occurs in the 10-base chunk, and calculate the entropy using the formula you've seen in class. This will give you one number. Then the window will slide down by one base and the method will look at bases 2-11 and do the same thing, and so on and so on. So in the end, you'll have a series of numbers representing entropies, as a window of size 10 is slid over the full sequence. If you use a word size to 2 bases, then instead of counting A,C,T,G frequencies, it'll count dinucleotide frequencies instead (AA, AC, AT....). So in the results you're visualizing with xgraph, the first point is the entropy of bases 1-10, the second point is the entropy of bases 2-11, etc etc.
(Aside: entropy is calculated using the formula you've seen in class. When you use a log base 2, the resulting summation, i.e. the entropy, will be in units of bits, so it has a more 'well-defined' meaning than say, if you used some other base for the log. But the relative behavior of entropy, such as whether it increases or decreases, should be the same no matter what base you use for the log.)
So what does low-complexity mean? If you have low complexity regions, then you would expect highly skewed frequencies/probabilities (for example, in the worst case, you have all As so the probability of an A is 1 and the probability of anything else is 0), resulting in low entropy when you plug them into the equation. Remember one point that Prof Holmes made in class - a uniform probability distribution, e.g. 1/4 for each of the four bases, gives you the maximum entropy. So high entropy -> more uniform probabilities -> more randomness -> higher sequence complexity. When you plot the entropy as a function of "window number", you're basically looking at how complex the sequence is within each of the windows, i.e. each little chunk of the sequence.
If you see an increase in entropy, that means the sequence within the window you're looking at seems more complex/less repetitious. But of course, this depends on how you're calculating entropies. If within your window, you have all As, then you would get a zero entropy no matter what word size you use. But if you have ACTGACTGACTGACTGACTG, and you used a word size of 1, you would still get 1/4 for each of the bases and that section of the sequence would actually have a very high entropy! If you increase your word size to 4, you would be able to pick up on the repetition because the entropy will be low compared to that of a completely random sequence (although this entropy won't be 0 - do you see why?).
On your entropy graphs, then, if you see areas of low entropy, it would give you a clue that you've found an area in the sequence that has some sort of repetition that is being picked up by the window size and word size you're using. The important thing to note here is that how well you detect repeats depends a lot on the window size and the word length you choose and how well they match up with the repeating structure in the sequence of interest.
So to sum up, sliding window entropy can pick up direct repeats (if you choose appropriate window sizes and word lengths) but will miss inverted repeats because a sequence like ACT...AGT won't register as a repeat no matter what window size/word length you choose. Also, if you have a repeating sequence that's scattered across the genome or just doesn't occur that often in series, the sliding entropy method would have a hard time picking this up as well.
- Try compressing the entire Ehrlichia genome using a standard data compression tool, for example
bzip2. Comment on the results.
The uncompressed FASTA file is about 1.5Mb and the gzip-ed version is about 500kb. Since there's a significant reduction in file size upon compression, this implies that the file, and therefore the genome, probably contains a lot of repetitious information that can be represented in a more compact way. Although this is not the way most compression algorithms work (they're much more complicated), here is one intuitive way to think about compression: if you had 1000 A's in a sequence, you can either write out all 1000 of the A's or you can just say Ax1000. The latter would obviously reduce the size of your file.
Copyright © 2008-2013 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback