Research Teaching
 Fall12 | Sandbox
 Biowiki > Teaching > BioE131 > BioE131WeeklyExercises > AlignmentAndPhylogenyExercises

 Search all webs web groupthis web topic names Advanced search...

 Topics
Try to answer the following questions:
1. What are the memory and runtime requirements (in big-O notation) of each of the following tasks, on a dataset of N sequences each of length L:
1. simultaneous multiple sequence alignment multidimensional dynamic programming;
2. progressive alignment of the same set of sequences using a guide tree that is already known;
3. estimation of a guide tree from a distance matrix, using an algorithm like UPGMA or neighbor-joining, as a precursor to progressive alignment;
4. estimation of a distance matrix by aligning every pair of sequences and then using the Jukes-Cantor distance.
2. What is the difference between a linear gap penalty & an affine gap penalty? Which one produces more realistic alignments, and why?
3. An individual column of a multiple alignment contains nucleotides of type k (so adenines, cytosines, guanines and thymines).
1. What is the probability that a randomly sampled column would have these counts, assuming a uniform distribution over nucleotides, with each row of the alignment being sampled independently?
2. What is the probability if one assumes a 60% GC content, instead of a uniform distribution?
3. What about arbitrary probabilities ?
1. What values of would maximize this probability?
2. What is the maximal value of the likelihood?
• How does this relate to the entropy of the column?
4. How do these results relate to the Wikipedia:Multinomial_distribution ?
4. The following question assumes the Jukes-Cantor model for sequence evolution.
1. Describe the assumptions behind this model.
2. Start with a sequence of 5 nucleotides and let the model run for a very long (near-infinite) time. What is the probability that, after this time, the sequence is `AAGTC` ?
3. Reset the clock, start the sequence at `AAGTC` and let the model run for another t units of evolutionary time (calibrated to one expected substitution per site, per unit of time; NB a substitution here means an observable difference, rather than a replacement event which may, or may not, be observable). What is the probability, P, that, after time t and given that the sequence started at `AAGTC`, it ends up as `AGGGC` ?
4. Let's phrase that last question slightly differently. What is the probability, Q, that, after time t , the starting and ending sequences were identical at 3 of their 5 nucleotide positions, and different at the remaining 2 positions? Why is this probability different from the probability (P), requested by the previous question?
5. Generalize both probabilities (P and Q), requested by the previous two questions, to a sequence of length L that (after time t ) is identical to the ancestral sequence at N positions, and different at L-N positions. Find the value t that maximizes this probability (if such a maximal value of t exists).
6. How (and why) are the assumptions behind the Jukes-Cantor model violated by real data?
5. What is Wikipedia:BLAST ?
• Describe the following steps/data structures in the BLAST algorithm: (a) low-complexity masking (the SEG and DUST programs), (b) word-matching, (c) high-scoring segment pairs.
• What is the Wikipedia:Extreme_value_distribution ? What is its relationship to BLAST?
• What kind of alignments can BLAST detect, and what will it miss?

-- IanHolmes - 30 Oct 2010

 Actions:
 Teaching.AlignmentAndPhylogenyExercisesr10 - 2010-11-18 - 19:41:37 - IanHolmes Biowiki content is in the public domain. Comments on this site? AlignmentAndPhylogenyExercises">Send feedback