Home - this site is powered by TWiki(R)
Teaching > BioE131 > BioE131WeeklyExercises > AlignmentAndPhylogenyExercises
TWiki webs: Main | TWiki | Sandbox   Log In or Register

Try to answer the following questions:
  1. What are the memory and runtime requirements (in big-O notation) of each of the following tasks, on a dataset of N sequences each of length L:
    1. simultaneous multiple sequence alignment multidimensional dynamic programming;
    2. progressive alignment of the same set of sequences using a guide tree that is already known;
    3. estimation of a guide tree from a distance matrix, using an algorithm like UPGMA or neighbor-joining, as a precursor to progressive alignment;
    4. estimation of a distance matrix by aligning every pair of sequences and then using the Jukes-Cantor distance.
  2. What is the difference between a linear gap penalty & an affine gap penalty? Which one produces more realistic alignments, and why?
  3. An individual column of a multiple alignment contains nucleotides of type k (so adenines, cytosines, guanines and thymines).
    1. What is the probability that a randomly sampled column would have these counts, assuming a uniform distribution over nucleotides, with each row of the alignment being sampled independently?
    2. What is the probability if one assumes a 60% GC content, instead of a uniform distribution?
    3. What about arbitrary probabilities ?
      1. What values of would maximize this probability?
      2. What is the maximal value of the likelihood?
        • How does this relate to the entropy of the column?
    4. How do these results relate to the Wikipedia:Multinomial_distribution ?
  4. The following question assumes the Jukes-Cantor model for sequence evolution.
    1. Describe the assumptions behind this model.
    2. Start with a sequence of 5 nucleotides and let the model run for a very long (near-infinite) time. What is the probability that, after this time, the sequence is AAGTC ?
    3. Reset the clock, start the sequence at AAGTC and let the model run for another t units of evolutionary time (calibrated to one expected substitution per site, per unit of time; NB a substitution here means an observable difference, rather than a replacement event which may, or may not, be observable). What is the probability, P, that, after time t and given that the sequence started at AAGTC, it ends up as AGGGC ?
    4. Let's phrase that last question slightly differently. What is the probability, Q, that, after time t , the starting and ending sequences were identical at 3 of their 5 nucleotide positions, and different at the remaining 2 positions? Why is this probability different from the probability (P), requested by the previous question?
    5. Generalize both probabilities (P and Q), requested by the previous two questions, to a sequence of length L that (after time t ) is identical to the ancestral sequence at N positions, and different at L-N positions. Find the value t that maximizes this probability (if such a maximal value of t exists).
    6. How (and why) are the assumptions behind the Jukes-Cantor model violated by real data?
  5. What is Wikipedia:BLAST ?
    • Describe the following steps/data structures in the BLAST algorithm: (a) low-complexity masking (the SEG and DUST programs), (b) word-matching, (c) high-scoring segment pairs.
    • What is the Wikipedia:Extreme_value_distribution ? What is its relationship to BLAST?
    • What kind of alignments can BLAST detect, and what will it miss?

-- IanHolmes - 30 Oct 2010

Edit | Attach | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | Raw edit | More topic actions

This site is powered by the TWiki collaboration platformCopyright © 2008-2014 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
TWiki Appliance - Powered by TurnKey Linux