Before we begin
- This lab assumes you are now comfortable navigating through a Linux environment and manipulating and using Python files. There will also be Perl programs you will be using, but you have done this as well. If you run into basic problems, consult previous labs, a neighbor, or if needed the GSI.
- Several new tools are introduced in this lab. Spend a little bit of time familiarizing yourself with them both by reading about them on their source websites, and by tinkering with them.
Goals
- Get a feel for the information content of real genomes, and sources of redundancy (repeats, low complexity, etc)
- Analyze composition (nucleotide, dinucleotide, trinucleotide) and entropy of a real genome
- Dotplot vs self
- Get a feel for microsatellites, repeats, low-complexity, inversions, duplications
- Try compressing a genome
- Use a sliding-window entropy filter to mask out low-complexity regions
- Get exposure to GFF and FASTA formats
Data
Software
For your general reference:
- Dotplot program(s)
- Dotter (linux binary, runs locally, no size restriction)
- JDotter (Java, runs on server, has size restriction)
- Dotlet (Java)
- xgraph - download from ISI or Radford Neal's site
- Various Perl scripts linked below, including GffTools
On the DECF computers, the following programs are installed and you may use them directly:
- Dotter:
$ ~be131/src/dotter [This one should work. --BE]
- xgraph:
$ ~be131/src/xgraph-12.1/xgraph (manual pages can be viewed by $ man ~be131/src/xgraph-12.1/xgraph.man) [This may not work currently - working on fixing installation. --BE]
- The relevant Perl files are in ~be131/src/perl
- Note:* To get usage information for most of the Perl scripts used in this lab, use the -h option. For example, type:
composition.pl -h
- EMBL file for Ehrlichia ruminantium and FASTA file for Bifidobacterium are also in
~be131/InformationContentOfDNA/
Procedure
- The practical is split into two parts. The first part involves generally playing around with complexity analysis tools including dotplots, compositional analysis, sliding-window entropy and compression. The second part involves using one of these tools (sliding-window entropy) to filter out low-complexity sequences from the genome of a Bifidobacterium.
- The second part of the practical (repeat-masking Bifidobacterium) is quick, but essential for future practicals (wherein we will attempt to predict genes in the same genome). Be sure to finish the second part, even at the cost of not following up every last suggestion in the first part.
- First part.
- The key points are highlighted in boldface.
- Tools for analyzing sequence complexity.
- Familiarise yourself with the concept of a dotplot for analyzing biological sequence similarity. Several free dotplot programs are available. Play around with e.g. the Dotlet examples and ensure you are able to sketch the dotplots for direct repeats, inverted repeats and regions of high or low sequence complexity.
- You may find it useful to examine a highly repetitive genome such as that of Ehrlichia ruminantium.
- The above link is to an EMBL database entry, which has a somewhat idiosyncratic format (like most file formats in bioinformatics); see if you can make sense of it. Use this embl2fasta.pl script to extract the sequence in FastaFormat (most of the tools for DNA analysis that we will be using take FastaFormat as an input, so you'll need to convert from EMBL format in most cases.)
-
embl2fasta.pl er.embl > er.fasta
- What is the nucleotide composition of the Ehrlichia genome? Use e.g. composition.pl, or write something better. What is the entropy of this probability distribution?
-
composition.pl er.fasta
- How about the dinucleotide composition? (Use the switch "-n 2" in the composition.pl script.)
-
composition.pl -n 2 er.fasta
- Use a sliding-window entropy program to scan across various parts of the Ehrlichia genome, including microsatellite and tandem repeat regions. In particular try here, rel="nofollow" here, rel="nofollow" here rel="nofollow" and here. rel="nofollow" Visualize the results by piping them into xgraph. Try playing with the
-n and -w parameters to change the word length and window size (respectively). Compare the results to a dotplot. What sorts of repeat is the sliding-window entropy method good at picking up, and what does it miss?
-
embl2fasta.pl er_477.embl > er_477.fa
-
seqentropy.pl er_477.fa > er_477.ent
-
cat er_477.ent | xgraph &
- Try compressing the entire Ehrlichia genome using a standard data compression tool, for example
gzip or bzip2. What's the result? (If you have time, try compressing several different parts of the genome.)
- Second part.
- Review the following background info on Bifidobacteria. If there is something you don't know, consider it a test of your Google skills...
- genome structure of prokaryotes in general
- biomedical relevance of Bifidobacteria
- sequencing and assembly process: reads, contigs, scaffolds
- why are there e.g. N's in the sequence?
- Download the genome in FastaFormat (use the "Send to" link at the top-right), or copy it from
~be131/InformationContentOfDNA/.
- When files become large, you may want to reduce processing time for any given step by splitting multi-sequence FASTA file into individual sequences using e.g. attached Perl script splitseq.pl. For this lab, it's probably ok to do it all at once ... just be patient! Also, you can make smaller FASTA files from single-sequence files using the
head command, provided the FASTA sequence is on multiple lines.
- As preparation for subsequent exercises in prokaryotic genefinding, you'll need to mask out low-complexity regions from this genome.
- "Masking out" means replacing certain nucleotides with N's or X's. You can use a sliding-window entropy method to do this with the cfilter.pl ("complexity filter") script, which has similar syntax to seqentropy.pl (the other sliding-window entropy script that you've been using above).
-
cfilter.pl genome.fasta > genomeMask.fasta &
- Verify the masking worked by typing:
grep -i 'x' genomeMask.fasta | more
- As practise for later work, if you have time, try splitting this process into two steps. First use the '-gff' option (for the cfilter.pl program) to return the co-ordinates of the low-entropy regions in GFF format.
-
cfilter.pl -gff genome.fasta > entropy.gff &
- When you do this, the cfilter.pl script also returns the coordinates of the high-entropy regions. To remove those from the GFF file, use
-
gfffilter.pl '$gfffeature eq "low"' entropy.gff > lowEntropy.gff
- Then use the gffmask.pl script to mask out the GFF-specified co-ords. gffmask.pl reads the standard input so
cat the genome file first and then pipe it into gffmask.pl. Why might this method be (slightly) more flexible?
-
cat genome.fasta | gffmask.pl lowEntropy.gff > genomeMask.fasta &
Homework
PythonPairwiseAlignmentLab

Copyright © 2008-2013 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback