click on the Biowiki logo to go to homepage
Edit Raw Print
Links Diffs RSS
About Stats Recent


Research Teaching Blog
Fall09 | Sandbox
Biowiki > Teaching > BioE241SubstitutionLab

Search

Advanced search...

Topics

PageRank Checker

Substitution model lab

Model

A point substitution model with uniform base frequencies that distinguishes between transitions (purine-to-purine or pyrimidine-to-pyrimidine substitutions) and transversions (all other nucleotide substitutions).

The rate matrix for this model, over the alphabet (A,C,G,T), is R=\left(\begin{array}{llll}-K&B&A&B\\B&-K&B&A\\A&B&-K&B\\B&A&B&-K\end{array}\right) where K=A+2B.

(Note that this matrix does not obey the "standard" time scaling of having one substitution per unit of time, i.e. K=1)

Goals

Attempt the following:
  • write code to...
    • simulate pairwise alignment from model given (A,B,L,T)
    • estimate T given pairwise alignment and (A,B,L)
  • investigate accuracy of T-estimation over the following parameter range
    • A = 1
    • for B in (0.1, 0.5, 1)
    • for L = 10 to 100 step 10
    • for T = .1 to 2 step .1

Approaches

There are several different approaches to solving the problem posed:
  1. use an existing implementation of simulation and inference under this model, such as the XRATE software in the DART package. While this does not in itself constitute a 100% complete assignment (since the idea is to actually write your own program, even if you use external libraries to do the heavy lifting) this is still a good independent test of any other code that you write, and a natural first step.
    • Kimura's two-parameter model is implemented as the xrate-format grammar file DartGrammar:kimura2.eg
    • the SimGram program can be used to sample alignments from the probabilistic model in a given grammar file, for a given phylogeny
    • you can then use xrate's -e option to estimate the branch lengths of the tree, which (if the tree contains only one branch of length T) amounts to fitting the maximum likelihood value of T
    • you will need to install dart, of course, and it's probably useful to be able to do some sort of scripting (Perl, shell scripting, etc.)
  2. the most generalizable (for this course) solution here is to implement a general matrix exponential function, by hooking into an appropriate library that either implements exp(A) directly for a matrix A, or implements other standard linear algebra routines (such as diagonalization algorithms) from which a matrix exponential can be constructed.
    • a C++ library that implements the requisite linalg routines is NewMat. This library is included with DART (and augmented to include matrix exponential)
    • there are other such libraries e.g. LAPACK (Wikipedia) (some of Dart's matrix handlers are derived from this rather than Newmat)
  3. an alternate approach which may be simpler (depending on your perspective), and is certainly a good independent test, is to solve exp(R*t) analytically for Kimura model's rate matrix R
    • one way to do this is to solve the characteristic polynomial (Wikipedia) to find the eigenvalues, then solve for the eigenvectors
  4. the simulation can use your matrix exponential function too; an alternative is to simulate continuous-time Markov chain trajectories by sampling the wait time to the next change in state (from an exponential distribution, e.g. by Inverse transform sampling (Wikipedia) )
    • this is a useful technique that you'll probably use at some time or other, so worth thinking through at this stage

-- IanHolmes - 25 Jan 2010

Actions: Edit | Attach | New | Ref-By | Printable view | Raw view | Normal view | See diffs | Help | More...