Home - this site is powered by TWiki(R)
Teaching > EvolutionaryHmmForGenefinding > BendanaChaoTemmeGraduateProject
TWiki webs: Main | TWiki | Sandbox   Log In or Register

Graduate Students: Yuri Bendana, Sharon Chao, Karsten Temme

BendanaChaoTemmeGraduateProjectReport

Motivation

The diverse evolutionary patterns of different regions in the genome can be used to help identify genes. Evolutionary Hidden Markov Model (EHMM) capitalize on this property by modelling both gene structure and evolutionary pattern [Pederson and Hein, 2003]. Furthermore,EHMMs are particularly appropriate for gene finding with their ability to handle a variable number of sequences along the alignment. In addition, the emergence of cutting edge phylogenetic models also allows for the adoption of new methods for applying evolutionary models. David Haussler's context dependent model of substitution, for example, is a recent advance that adjusts substitution rates for each position according to its neighboring bases.

Summary

The evogene algorithm uses EHMM, which comprises a probabilistic model for both gene structure and sequence evolution. It is composed of an HMM and region specific evolutionary models based on a phylogenetic tree. Given the phylogenetic tree and an alignment, our model will then predict genes by finding the maximum a posterior estimate using Viterbi's algorithm [Pederson and Hein, 2003].

Methodology

We used the xgram program from the DART software package [Holmes, 2005] to reproduce the functionality of evogene. xgram allowed us to use stochastic context free grammars in conjunction with evolutionary information for identifying intronic regions, their splice acceptor and donor sites, frameshift mutations, and codon annotations including the start and stop codons. We generated a dataset of eight genes using the same human and mouse ortholog genes that Pederson and Hein used to evaluate their algorithm. Evogene currently models the following gene structural elements: start/end codon, exon, intron and splice donor/acceptor sites.

References

  • Holmes, 2005. DART software.

Presentation

EHMMs

  • State-specific continuous Markov chain E(x) characterized by instantaneous rate matrix R(x)
  • Phylogenetic tree T
  • Probability of alignment column in state x = probability of observing character pattern c on the leaves of the tree e sub x (c) = P(c|E(x), T)
  • Dynamic programming pruning algorithm used to calculate the likelihood as a sum over all possible configurations of the unknown inner nodes of the tree
  • Evogene: train with alignment, annotation and tree; search on one sequence providing a tree, or an msa
  • Pederson figure for HMM
  • Baum-Welch, Powell used for parameter estimation
  • Viterbi used to give a MAP estimate of predicted genes
  • Gaps treated as missing data

Main.DART

  • xgram
  • SCFGs specified in Lisp
  • Markov chain which includes the initial probabilities and transition rate matrix and defines the list of pseudoterminals
  • Null, Emit, and Bifurcation states
  • xfold and xprot have built-in grammars for RNA and proteins

Testing methods

  • Verify mouse/human alignment vs evogene
  • mreB: no introns, no intergenic regions
  • mreB + intergenic regions
  • actin + introns
  • actin + introns + intergenic flanking regions

Grammar extensions

  • UTR modeling
  • promoter modeling
  • Intron splice sites
  • If time permits, frameshift
Gene finding
  • If we train on a single gene with alignments from multiple species, our grammar becomes specialized for searching for a similar gene in a new species.
  • On the contrary, if we train on many genes from a single species, our grammar is tailored for locating new genes in the same species.
  • The combination of many genes aligned over many species would hopefully generate a global genefinding algorithm.

to see old changes: * BendanaChaoJaworskiTemmeGraduateProject

I Attachment Action Size Date Who Comment
Zipzip 2Alignments.zip manage 34.3 K 2005-12-19 - 08:05 TWikiGuest Human and mouse alignment for comparison with human, mouse and rat
Zipzip 8Genes.zip manage 1201.8 K 2005-12-19 - 07:34 TWikiGuest Alignments, annotations, etc for 8 genes (MUSACASA is in ACTA1)
Txttxt Creatingdatasets.txt manage 1.9 K 2005-12-21 - 20:23 TWikiGuest Updated to include gff2ps and sto2gff.pl
Zipzip Datasets.zip manage 473.9 K 2005-12-18 - 03:37 TWikiGuest COX6A2 and OXT genes with codon annotations, scripts included
Pptppt FinalPresentation-1.ppt manage 428.0 K 2005-12-07 - 18:58 TWikiGuest Last notes from Karsten
Pptppt FinalPresentation.ppt manage 401.0 K 2005-12-07 - 18:25 TWikiGuest Minor mod to Evogene slide [YB]
Zipzip Training_and_testing.zip manage 233.5 K 2005-12-19 - 08:04 TWikiGuest Some training and testing data by KT
Zipzip Two_gene_results.zip manage 64.7 K 2005-12-18 - 05:49 TWikiGuest Training and testing over COX6A2 and OXT
Txttxt batch_sre.pl.txt manage 1.0 K 2005-12-15 - 17:11 TWikiGuest Batch conversion of phylip to stockholm format via sreformat
Txttxt codonframeshift.eg.txt manage 223.0 K 2005-12-19 - 06:30 TWikiGuest RNA model with frameshift
Txttxt codonintron.eg.txt manage 224.2 K 2005-12-19 - 05:18 TWikiGuest Codon model with Intron splice acceptor and donor sites
Txttxt codonstartstop.eg.txt manage 224.3 K 2005-12-19 - 06:19 TWikiGuest Codon model with start and stop codons
Zipzip evogene_data.zip manage 781.6 K 2005-12-15 - 09:16 TWikiGuest Evogene data converted to Stockholm (PFam) format
Rcrc gff2ps.rc manage 0.6 K 2005-12-21 - 01:43 TWikiGuest gff2ps custom parameters
Gffgff hsu66875.gff manage 0.9 K 2005-12-16 - 22:13 TWikiGuest GFF annotation
Stockstock hsu66875_mmu63716_gapExtr.stock manage 3.1 K 2005-12-16 - 22:12 TWikiGuest Human seq as reference with 2 introns
Gffgff humsap01.gff manage 0.9 K 2005-12-16 - 22:09 TWikiGuest GFF annotation
Stockstock humsap01_mussaprb_gapExtr.stock manage 2.8 K 2005-12-16 - 22:08 TWikiGuest Human seq as reference with 1 intron
Gffgff musoxyneui.gff manage 0.9 K 2005-12-16 - 22:21 TWikiGuest GFF annotation
Stockstock musoxyneui_humotnpi_gapExtr.stock manage 4.0 K 2005-12-16 - 22:21 TWikiGuest Mouse seq as reference with 2 intron
Gffgff mussaprb.gff manage 0.4 K 2005-12-16 - 22:17 TWikiGuest GFF annotation
Stockstock mussaprb_humsap01_gapExtr.stock manage 2.7 K 2005-12-16 - 22:16 TWikiGuest Mouse seq as reference with 1 intron
Txttxt queryPandit.pl.txt manage 2.1 K 2005-12-16 - 07:30 TWikiGuest Update to print sequence LNK lines.
Txttxt sto2gff.pl.txt manage 6.7 K 2005-12-21 - 01:42 TWikiGuest Updates to work with codon-only alignment
Txttxt stoAnnot.pl.txt manage 7.7 K 2005-12-19 - 02:46 TWikiGuest Slight correction to error check of inputs
Edit | Attach | Print version | History: r89 < r88 < r87 < r86 < r85 | Backlinks | Raw View | Raw edit | More topic actions

This site is powered by the TWiki collaboration platformCopyright © 2008-2013 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
TWiki Appliance - Powered by TurnKey Linux