click on the Biowiki logo to go to homepage
Edit Raw Print
Links Diffs RSS
About Stats Recent


Research Teaching Blog
Fall09 | Sandbox
Biowiki > Teaching > BioE241PhyloHMMLab

Search

Advanced search...

Topics

PageRank Checker

Bio E 241 phylo-HMM lab

Model

  • A hidden Markov model (HMM) with N states, where each state emits columns of a multiple sequence alignment, according to the following generative model...
    • a phylogenetic tree, T
    • a Kimura 2-parameter substitution rate matrix (see BioE241 substitution lab). To begin with try A=.8 (transitions) and B=.1 (transversions)
    • a state-specific scaling on the rate matrix. These should span an "appropriate" set R of rate multipliers, e.g. R = { 0.2, 0.5, 1.0, 2.0, 5.0 }, where |R|=N (in this case N=5)
  • The transition probabilities of the HMM should be P for staying in the same state, and (1-P)/(N-1) for going into any other state.
    • Try P=0.9 to begin with
    • Strictly speaking this is not normalized because we have not included the transitions from the Start state, or to the End state. In practice, for a given alignment, we know exactly when the transitions to End will occur, as they are implied by the fixed length of the alignment. So as long as the transitions to End don't depend on which state you're transiting from, they can be omitted. (An extreme counterexample, where the End transitions do depend on the state, would be if some states had zero probability of transiting to End). The transitions from Start are more significant (since they define a prior over which state the first column of the alignment was emitted by); the obvious thing to do here would be to make that prior uniform (i.e. the transition probability from Start to any state is 1/N). If you want to put these transitions into the model explicitly, you can - it shouldn't affect the outcome of your inference

Note that this model is essentially similar to those described in the following papers (among others)

Goals

  • Write code to implement the Viterbi algorithm to parse a Stockholm-format multiple alignment of nucleotide sequences into regions of varying mutation rate.
  • Use your program to find (a) the Viterbi log-likelihoods, (b) the slowest-evolving regions of the following 18-species alignment to human chromosome 21: 18-species.stk
    • Note that the alignment contains both lower- and upper-case characters; you might want to put everything in either upper or lower case, for your purposes.
    • Note also that your program will need a tree. A suitable 28-species tree is linked below; however, note that the 18 species in the given alignment correspond to a subset of that tree; also, the names in the tree do not quite match up to those in the alignment.
    • Dealing with these sorts of data-scrubbing issues is part of the daily grind of bioinformatics. I leave it to you to figure out the most expedient way to deal with the above issues (hint: you can edit the tree, or edit the alignment...)
    • A more serious issue is that the tree lacks branch lengths. You can use xrate to estimate these, once you have a tree. Use the "-e" option with e.g. the Jukes-Cantor model
      • xrate's tree-estimation routine has two steps: it first estimates a tree topology using neighbor-joining, then it uses the EM algorithm to refine estimates of the branch lengths. If you supply an initial guess at the tree, xrate will skip the first step (topology estimation) and, instead, use the tree topology you supplied. This can save time and, even more importantly, can be used to constrain the estimated tree to a particular topology.
  • A few more notes on the data:

-- IanHolmes - 28 Feb 2010

Attachment sort Action Size Date Who Comment
18-species.stk manage 25.1 K 28 Feb 2010 - 11:43 IanHolmes 18-vertebrate genome alignment

Actions: Edit | Attach | New | Ref-By | Printable view | Raw view | Normal view | See diffs | Help | More...