Bendana Chao Jaworski Temme Graduate Project

From Biowiki
Jump to: navigation, search

Graduate Students: Yuri Bendana, Sharon Chao, Karsten Temme


The diverse evolutionary patterns of different regions in the genome can be used to help identify genes. Evolutionary Hidden Markov Model (EHMM) capitalize on this property by modelling both gene structure and evolutionary pattern [Pederson and Hein, 2003]. Furthermore,EHMMs are particularly appropriate for gene finding with their ability to handle a variable number of sequences along the alignment. In addition, the emergence of cutting edge phylogenetic models also allows for the adoption of new methods for applying evolutionary models. David Haussler's context dependent model of substitution, for example, is a recent advance that adjusts substitution rates for each position according to its neighboring bases.


The evogene algorithm uses EHMM, which comprises a probabilistic model for both gene structure and sequence evolution. It is composed of an HMM and region specific evolutionary models based on a phylogenetic tree. Given the phylogenetic tree and an alignment, our model will then predict genes by finding the maximum a posterior estimate using Viterbi's algorithm [Pederson and Hein, 2003].

Proposed Methodology

Our aim is to use the xgram program from the DART software package [Holmes, 2005] to reproduce the functionality of evogene. xgram will allow us to use stochastic context free grammars in conjunction with HMMs for gene finding. We will compare our program using the same human and mouse ortholog genes that Pederson and Hein used to evaluate their algorithm. evogene currently models the following gene structural elements: start/end codon, exon, intron and splice donor/acceptor sites. If time permits, we will add additional states to the HMM to model another gene structural element, for example, a ribosome binding site or promoter region. This increased model complexity should result in improved sensitivity of gene identification, as evaluated on the house/mouse genes.


  • Holmes, 2005. DART software.



  • State-specific continuous Markov chain E(x) characterized by instantaneuous rate matrix R(x)
  • Phylogenetic tree T
  • Probability of alignment column in state x = probability of observing character pattern c on the leaves of the tree

e sub x (c) = P(c|E(x), T)

  • Dynamic programming pruning algorithm used to calculate the likelihood as a sum over all possible configurations of the unknown inner nodes of the tree
  • Evogene: train with alignment, annotation and tree; search on one sequence providing a tree, or an msa
  • Pederson figure for HMM
  • Baum-Welch, Powell used for parameter estimation
  • Viterbi used to give a MAP estimate of predicted genes
  • Gaps treated as missing data


  • xgram
  • SCFGs specified in Lisp
  • Markov chain which includes the initial probabilities and transition rate matrix and defines the list of pseudoterminals
  • Null, Emit, and Bifurcation states
  • xfold and xprot have built-in grammars for RNA and proteins

Testing methods

  • Verify mouse/human alignment vs evogene
  • mreB: no introns, no intergenic regions
  • mreB + intergenic regions
  • actin + introns
  • actin + introns + intergenic flanking regions

Grammar extensions

  • UTR modeling
  • promoter modeling
  • Intron splice sites
  • If time permits, frameshift

Gene finding

  • If we train on a single gene with alignments from multiple species, our grammar becomes specialized for searching for a similar gene in a new species.
  • On the contrary, if we train on many genes from a single species, our grammar is tailored for locating new genes in the same species.
  • The combination of many genes aligned over many species would hopefully generate a global genefinding algorithm.
  • Presentation.ppt: Rough draft of presentation