Homework: Information Content of DNA

This homework is derived from the lab you did on InformationContentOfDNA. In this case we will be studying a cosmid of the Mycobacterium leprae genome, which has been downloaded to ~be131/Teaching.InformationContentOfDNA/mleprae.fasta. Use the methods of the lab to answer the following questions:

  1. Sketch the dotplots for direct repeats, inverted repeats and regions of high or low sequence complexity.
  2. What is the nucleotide composition of the Mycobacterium cosmid? What is the entropy of this distribution? What is the dinucleotide composition?
  3. Use a sliding-window entropy program to scan across various parts of the Mycobacterium cosmid, including microsatellite and tandem repeat regions. In particular, try here. Visualize the results by piping them into xgraph. Try playing with the -n and -w parameters to change the word length and window size (respectively). Compare the results to a dotplot. What sorts of repeat is the sliding-window entropy method good at picking up, and what does it miss?
  4. This journal article discusses a TTC repeat in the M. leprae genome. Does the sliding-window entropy method identify it in the cosmid?
  5. Try compressing the Mycobacterium cosmid using a standard data compression tool, for example gzip or bzip2. What's the result?

