Try to answer the following questions:
- What are the memory and runtime requirements (in big-O notation) of each of the following tasks, on a dataset of N sequences each of length L:
- simultaneous multiple sequence alignment multidimensional dynamic programming;
- progressive alignment of the same set of sequences using a guide tree that is already known;
- estimation of a guide tree from a distance matrix, using an algorithm like UPGMA or neighbor-joining, as a precursor to progressive alignment;
- estimation of a distance matrix by aligning every pair of sequences and then using the Jukes-Cantor distance.
- What is the difference between a linear gap penalty & an affine gap penalty? Which one produces more realistic alignments, and why?
- An individual column of a multiple alignment contains
nucleotides of type k (so
adenines,
cytosines,
guanines and
thymines).
- What is the probability that a randomly sampled column would have these counts, assuming a uniform distribution over nucleotides, with each row of the alignment being sampled independently?
- What is the probability if one assumes a 60% GC content, instead of a uniform distribution?
- What about arbitrary probabilities
?
- What values of
would maximize this probability?
- What is the maximal value of the likelihood?
- How does this relate to the entropy of the column?
- How do these results relate to the Wikipedia:Multinomial_distribution ?
- The following question assumes the Jukes-Cantor model for sequence evolution.
- Describe the assumptions behind this model.
- Start with a sequence of 5 nucleotides and let the model run for a very long (near-infinite) time. What is the probability that, after this time, the sequence is
AAGTC ?
- Reset the clock, start the sequence at
AAGTC and let the model run for another t units of evolutionary time (calibrated to one expected substitution per site, per unit of time; NB a substitution here means an observable difference, rather than a replacement event which may, or may not, be observable). What is the probability, P, that, after time t and given that the sequence started at AAGTC, it ends up as AGGGC ?
- Let's phrase that last question slightly differently. What is the probability, Q, that, after time t , the starting and ending sequences were identical at 3 of their 5 nucleotide positions, and different at the remaining 2 positions? Why is this probability different from the probability (P), requested by the previous question?
- Generalize both probabilities (P and Q), requested by the previous two questions, to a sequence of length L that (after time t ) is identical to the ancestral sequence at N positions, and different at L-N positions. Find the value t that maximizes this probability (if such a maximal value of t exists).
- How (and why) are the assumptions behind the Jukes-Cantor model violated by real data?
- What is Wikipedia:BLAST ?
- Describe the following steps/data structures in the BLAST algorithm: (a) low-complexity masking (the SEG and DUST programs), (b) word-matching, (c) high-scoring segment pairs.
- What is the Wikipedia:Extreme_value_distribution ? What is its relationship to BLAST?
- What kind of alignments can BLAST detect, and what will it miss?
--
IanHolmes - 30 Oct 2010

Copyright © 2008-2013 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback