- What are the memory and runtime requirements (in big-O notation) of each of the following tasks, on a dataset of N sequences each of length L:
- simultaneous multiple sequence alignment multidimensional dynamic programming;
- progressive alignment of the same set of sequences using a guide tree that is already known;
- estimation of a guide tree from a distance matrix, using an algorithm like UPGMA or neighbor-joining, as a precursor to progressive alignment;
- estimation of a distance matrix by aligning every pair of sequences and then using the Jukes-Cantor distance.

- What is the difference between a linear gap penalty & an affine gap penalty? Which one produces more realistic alignments, and why?
- An individual column of a multiple alignment contains nucleotides of type k (so adenines, cytosines, guanines and thymines).
- What is the probability that a randomly sampled column would have these counts, assuming a uniform distribution over nucleotides, with each row of the alignment being sampled independently?
- What is the probability if one assumes a 60% GC content, instead of a uniform distribution?
- What about arbitrary probabilities ?
- What values of would maximize this probability?
- Can you prove this? ( Hint: use Wikipedia:Lagrange_multipliers )

- What is the maximal value of the likelihood?
- How does this relate to the entropy of the column?

- What values of would maximize this probability?
- How do these results relate to the Wikipedia:Multinomial_distribution ?

- The following question assumes the Jukes-Cantor model for sequence evolution.
- Describe the assumptions behind this model.
- Start with a sequence of 5 nucleotides and let the model run for a very long (near-infinite) time. What is the probability that, after this time, the sequence is
`AAGTC`

? - Reset the clock, start the sequence at
`AAGTC`

and let the model run for another*t*units of evolutionary time (calibrated to one expected**substitution**per site, per unit of time; NB a substitution here means an observable difference, rather than a replacement event which may, or may not, be observable). What is the probability, P, that, after time*t*and given that the sequence started at`AAGTC`

, it ends up as`AGGGC`

? - Let's phrase that last question slightly differently. What is the probability, Q, that, after time
*t*, the starting and ending sequences were identical at 3 of their 5 nucleotide positions, and different at the remaining 2 positions? Why is this probability different from the probability (P), requested by the previous question? - Generalize both probabilities (P and Q), requested by the previous two questions, to a sequence of length
*L*that (after time*t*) is identical to the ancestral sequence at*N*positions, and different at*L-N*positions. Find the value*t*that maximizes this probability (if such a maximal value of*t*exists). - How (and why) are the assumptions behind the Jukes-Cantor model violated by real data?

- What is Wikipedia:BLAST ?
- Describe the following steps/data structures in the BLAST algorithm: (a) low-complexity masking (the SEG and DUST programs), (b) word-matching, (c) high-scoring segment pairs.
- What is the Wikipedia:Extreme_value_distribution ? What is its relationship to BLAST?
- What kind of alignments can BLAST detect, and what will it miss?

-- IanHolmes - 30 Oct 2010

Edit | Attach | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | Raw edit | More topic actions

Copyright © 2008-2014 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.

Ideas, requests, problems regarding TWiki? Send feedback

Ideas, requests, problems regarding TWiki? Send feedback

TWiki Appliance - Powered by TurnKey Linux