|
| Substitution model lab
Model
A point substitution model with uniform base frequencies that distinguishes between transitions (purine-to-purine or pyrimidine-to-pyrimidine substitutions) and transversions (all other nucleotide substitutions).
The rate matrix for this model, over the alphabet (A,C,G,T), is where K=A+2B.
(Note that this matrix does not obey the "standard" time scaling of having one substitution per unit of time, i.e. K=1)
Goals
Attempt the following:
- write code to...
- simulate pairwise alignment from model given (A,B,L,T)
- estimate T given pairwise alignment and (A,B,L)
- investigate accuracy of T-estimation over the following parameter range
- A = 1
- for B in (0.1, 0.5, 1)
- for L = 10 to 100 step 10
- for T = .1 to 2 step .1
Approaches
There are several different approaches to solving the problem posed:
- use an existing implementation of simulation and inference under this model, such as the XRATE software in the DART package. While this does not in itself constitute a 100% complete assignment (since the idea is to actually write your own program, even if you use external libraries to do the heavy lifting) this is still a good independent test of any other code that you write, and a natural first step.
- Kimura's two-parameter model is implemented as the xrate-format grammar file DartGrammar:kimura2.eg
- the SimGram program can be used to sample alignments from the probabilistic model in a given grammar file, for a given phylogeny
- you can then use xrate's -e option to estimate the branch lengths of the tree, which (if the tree contains only one branch of length T) amounts to fitting the maximum likelihood value of T
- you will need to install dart, of course, and it's probably useful to be able to do some sort of scripting (Perl, shell scripting, etc.)
- the most generalizable (for this course) solution here is to implement a general matrix exponential function, by hooking into an appropriate library that either implements exp(A) directly for a matrix A, or implements other standard linear algebra routines (such as diagonalization algorithms) from which a matrix exponential can be constructed.
- a C++ library that implements the requisite linalg routines is NewMat. This library is included with DART (and augmented to include matrix exponential)
- there are other such libraries e.g. LAPACK (Wikipedia) (some of Dart's matrix handlers are derived from this rather than Newmat)
- an alternate approach which may be simpler (depending on your perspective), and is certainly a good independent test, is to solve exp(R*t) analytically for Kimura model's rate matrix R
- one way to do this is to solve the characteristic polynomial (Wikipedia) to find the eigenvalues, then solve for the eigenvectors
- the simulation can use your matrix exponential function too; an alternative is to simulate continuous-time Markov chain trajectories by sampling the wait time to the next change in state (from an exponential distribution, e.g. by Inverse transform sampling (Wikipedia) )
- this is a useful technique that you'll probably use at some time or other, so worth thinking through at this stage
-- IanHolmes - 25 Jan 2010 |