Graduate Students: Yuri Bendana, Sharon Chao, Karsten Temme
BendanaChaoTemmeGraduateProjectReport
Motivation
The diverse evolutionary patterns of different regions in the genome can be used to help identify genes. Evolutionary Hidden Markov Model (EHMM) capitalize on this property by modelling both gene structure and evolutionary pattern [Pederson and Hein, 2003]. Furthermore,EHMMs are particularly appropriate for gene finding with their ability to handle a variable number of sequences along the alignment. In addition, the emergence of cutting edge phylogenetic models also allows for the adoption of new methods for applying evolutionary models. David Haussler's context dependent model of substitution, for example, is a recent advance that adjusts substitution rates for each position according to its neighboring bases.
Summary
The evogene algorithm uses EHMM, which comprises a probabilistic model for both gene structure and sequence evolution. It is composed of an HMM and region specific evolutionary models based on a phylogenetic tree. Given the phylogenetic tree and an alignment, our model will then predict genes by finding the maximum a posterior estimate using Viterbi's algorithm [Pederson and Hein, 2003].
Methodology
We used the
xgram program from the
DART software package [Holmes, 2005] to reproduce the functionality of evogene. xgram allowed us to use stochastic context free grammars in conjunction with evolutionary information for identifying intronic regions, their splice acceptor and donor sites, frameshift mutations, and codon annotations including the start and stop codons. We generated a dataset of eight genes using the same human and mouse ortholog genes that Pederson and Hein used to evaluate their algorithm. Evogene currently models the following gene structural elements: start/end codon, exon, intron and splice donor/acceptor sites.
References
- Holmes, 2005. DART software.
Presentation
EHMMs
- State-specific continuous Markov chain E(x) characterized by instantaneous rate matrix R(x)
- Phylogenetic tree T
- Probability of alignment column in state x = probability of observing character pattern c on the leaves of the tree e sub x (c) = P(c|E(x), T)
- Dynamic programming pruning algorithm used to calculate the likelihood as a sum over all possible configurations of the unknown inner nodes of the tree
- Evogene: train with alignment, annotation and tree; search on one sequence providing a tree, or an msa
- Pederson figure for HMM
- Baum-Welch, Powell used for parameter estimation
- Viterbi used to give a MAP estimate of predicted genes
- Gaps treated as missing data
Main.DART
- xgram
- SCFGs specified in Lisp
- Markov chain which includes the initial probabilities and transition rate matrix and defines the list of pseudoterminals
- Null, Emit, and Bifurcation states
- xfold and xprot have built-in grammars for RNA and proteins
Testing methods
- Verify mouse/human alignment vs evogene
- mreB: no introns, no intergenic regions
- mreB + intergenic regions
- actin + introns
- actin + introns + intergenic flanking regions
Grammar extensions
- UTR modeling
- promoter modeling
- Intron splice sites
- If time permits, frameshift
Gene finding
- If we train on a single gene with alignments from multiple species, our grammar becomes specialized for searching for a similar gene in a new species.
- On the contrary, if we train on many genes from a single species, our grammar is tailored for locating new genes in the same species.
- The combination of many genes aligned over many species would hopefully generate a global genefinding algorithm.
to see old changes: *
BendanaChaoJaworskiTemmeGraduateProject

Copyright © 2008-2013 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback