
 Try to answer the following questions:
 What are the memory and runtime requirements (in bigO notation) of each of the following tasks, on a dataset of N sequences each of length L:
 simultaneous multiple sequence alignment multidimensional dynamic programming;
 progressive alignment of the same set of sequences using a guide tree that is already known;
 estimation of a guide tree from a distance matrix, using an algorithm like UPGMA or neighborjoining, as a precursor to progressive alignment;
 estimation of a distance matrix by aligning every pair of sequences and then using the JukesCantor distance.
 What is the difference between a linear gap penalty & an affine gap penalty? Which one produces more realistic alignments, and why?
 An individual column of a multiple alignment contains nucleotides of type k (so adenines, cytosines, guanines and thymines).
 What is the probability that a randomly sampled column would have these counts, assuming a uniform distribution over nucleotides, with each row of the alignment being sampled independently?
 What is the probability if one assumes a 60% GC content, instead of a uniform distribution?
 What about arbitrary probabilities ?
 What values of would maximize this probability?
 What is the maximal value of the likelihood?
 How does this relate to the entropy of the column?
 How do these results relate to the Wikipedia:Multinomial_distribution ?
 The following question assumes the JukesCantor model for sequence evolution.
 Describe the assumptions behind this model.
 Start with a sequence of 5 nucleotides and let the model run for a very long (nearinfinite) time. What is the probability that, after this time, the sequence is
AAGTC ?
 Reset the clock, start the sequence at
AAGTC and let the model run for another t units of evolutionary time (calibrated to one expected substitution per site, per unit of time; NB a substitution here means an observable difference, rather than a replacement event which may, or may not, be observable). What is the probability, P, that, after time t and given that the sequence started at AAGTC , it ends up as AGGGC ?
 Let's phrase that last question slightly differently. What is the probability, Q, that, after time t , the starting and ending sequences were identical at 3 of their 5 nucleotide positions, and different at the remaining 2 positions? Why is this probability different from the probability (P), requested by the previous question?
 Generalize both probabilities (P and Q), requested by the previous two questions, to a sequence of length L that (after time t ) is identical to the ancestral sequence at N positions, and different at LN positions. Find the value t that maximizes this probability (if such a maximal value of t exists).
 How (and why) are the assumptions behind the JukesCantor model violated by real data?
 What is Wikipedia:BLAST ?
 Describe the following steps/data structures in the BLAST algorithm: (a) lowcomplexity masking (the SEG and DUST programs), (b) wordmatching, (c) highscoring segment pairs.
 What is the Wikipedia:Extreme_value_distribution ? What is its relationship to BLAST?
 What kind of alignments can BLAST detect, and what will it miss?
 IanHolmes  30 Oct 2010 