This site is powered by the TWiki collaboration platform Powered by PerlBiowiki content is in the public domain.
Comments on this site? ProbabilityAndInformationExercises">Send feedback

Biowiki . Teaching . ProbabilityAndInformationExercises

Biowiki . Teaching . ProbabilityAndInformationExercises

Try to answer the following questions:

  1. Consider the sequence motif ATGGCTA. Roughly how frequently (giving an answer of the form "every N bases", and explaining your reasoning) would you expect to see this motif occurring in (a) a uniform IID DNA sequence, (b) an IID DNA sequence with GC content 60%? How well might you expect a naturally occurring genomic sequence to conform to these models, and in what ways would it deviate from the models? How would your answers change if the motif was the 8-mer ATATATAT instead of the 7-mer ATGGCTA?
  2. How would you use the PerlSequenceSimulator and the gzip program to empirically estimate the relative entropy D(P||Q) where P is an IID model for human genomic DNA, and Q is the implicit probability distribution underlying the Lempel-Ziv algorithm?
  3. What similar method could you use if Q is, instead, the implicit probability distribution underlying Burrows-Wheeler compression? (Hint: what Unix program implements Burrows-Wheeler compression?)
  4. The following exercises are from David MacKay's Information Theory, Inference and Learning Algorithms book, page 201 (link: MacKay book)
    1. What is the shortest the address on a typical international letter could be, if it is to get to a unique human recipient? (Assume the permitted characters are [A-Z,0-9].) How long are typical email addresses?
    2. How long does a piece of text need to be for you to be pretty sure that no human has written that string of characters before? How many notes are there in a new melody that has not been composed before?
    3. Some proteins produced in a cell have a regulatory role. A regulatory protein controls the transcription of specific genes in the genome. This control often involves the proteinís binding to a particular DNA sequence in the vicinity of the regulated gene. The presence of the bound protein either promotes or inhibits transcription of the gene.
      1. Use information-theoretic arguments to obtain a lower bound on the size of a typical protein that acts as a regulator specific to one gene in the whole human genome. Assume that the genome is a sequence of 3 ◊ 109 nucleotides drawn from a four letter alphabet {A,C,G,T}; a protein is a sequence of amino acids drawn from a twenty letter alphabet. [Hint: establish how long the recognized DNA sequence has to be in order for that sequence to be unique to the vicinity of one gene, treating the rest of the genome as a random sequence. Then discuss how big the protein must be to recognize a sequence of that length uniquely.]
      2. Some of the sequences recognized by DNA-binding regulatory proteins consist of a subsequence that is repeated twice or more, for example the sequence is a binding site found upstream of the alpha-actin gene in humans. Does the fact that some binding sites consist of a repeated subsequence influence your answer to part (a)?
  5. Let x be a DNA sequence of length L. Let y be another sequence of length L, of which N nucleotides are identical to the corresponding positions of x, and the remaining L-N nucleotides are different.
    1. Let X1 be a random variable that is a uniform IID DNA sequence of length L. Similarly let Y1 be another random uniform IID DNA sequence that is independent of X1. What is the joint probability P(X1=x,Y1=y) ?
    2. Let X2 be a random variable that is a uniform IID DNA sequence of length L. Let Y2 be a random variable that is also a DNA sequence of length L, obtained as follows: for any given position of Y2, with probability P the nucleotide at that position is identical to the corresponding position of X2; otherwise (with probability 1-P), it is sampled randomly from a uniform distribution. What is the joint probability P(X2=x,Y2=y) ?
    3. Let M be a random variable taking values in {1,2}, and let X3 and Y3 be random sequences of length L, determined as follows. With probability Q, set M=1, X3=X1 and Y3=Y1; otherwise (with probability 1-Q), set M=2, X3=X2 and Y3=Y2. (We can say that X3=X[M] and Y3=Y[M].) What are the following probabilities:
      1. P(M=2)
      2. P(X3=x,Y3=y,M=1)
      3. P(X3=x,Y3=y,M=2)
      4. P(X3=x,Y3=y)
      5. P(M=2|X3=x,Y3=y)
      6. What does the last of these, P(M=2|X3=x,Y3=y), signify?
      7. Which of the above is (a) a prior probability for the value of M, (b) a posterior probability for the value of M?
  6. Roughly (to 1 significant figure) how many (a) nucleotides and (b) genes are there in the genomes of the following organisms:
    1. HIV
    2. E.coli
    3. S.cerevisiae
    4. D.melanogaster
    5. H.sapiens

-- IanHolmes - 29 Oct 2010

----- Revision r10 - 2010-11-17 - 00:21:47 - IanHolmes
Biowiki content is in the public domain.
Comments on this site? ProbabilityAndInformationExercises">Send feedback