- Title: Probabilistic Modeling in Computational Biology
- Instructor: Ian Holmes
- Class: BioE 241
- When: Spring semester, 2010
- Lectures: Tue/Thurs 12:30-2, 105 Latimer
- Labs: Mon 12-3, 1171 Etcheverry
- Introduction and overview. Simulations of evolving DNA and RNA sequences and structures. Definition of the ancestral sequence reconstruction problem and domain (genes, genomes, RNA structures, families/pathways, phenotypic quantities). Some related problems in bioinformatics: phylogeny, alignment, annotation. Tools we will use, theoretical (ODEs, PDEs, matrices, graphs) and computational (various software packages).
- Continuous-time Markov chains. Finite & infinite state spaces, discrete & continuous times. Equilibrium, reversibility, matrix exponentiation. DNA substitution and indel models. Taylor series for matrix exponentiation.
- Eigenspaces. Solution of the matrix exponential by eigenvalue/eigenvector decomposition. Basic matrix theorems (real matrices have real or complex conjugate evals/evecs; in symmetric matrices they are real, and the evecs can be chosen to be orthonormal). Introduce notation for Markov chains on trees (graphical models), with a simple example (one root + two leaves). Briefly describe some common point substitution models: Jukes-Cantor (1969), Kimura (1980), Felsenstein (1981), Hasegawa-Kishino-Yano (1985), REV/IRREV, Dayhoff (1979).
- Pruning. Felsenstein pruning, likelihood calculation, Elston-Stewart peeling, posterior probability calculations; conceptualization as message-passing on factor graphs. Hardware acceleration (the Beagle library).
- HMMs. Isochores; HMMs; the Forward, Viterbi, Backward & Baum-Welch algorithms.
- EM. Definition of the Expectation Maximization algorithm; proof that it improves the likelihood at every step (via relative entropy, Jensen's inequality, Gibbs' inequality); the Neal-Hinton "variational free energy" view; and example applications (simple multinomial, mixture of Gaussians/K-means).
- Conjugate priors: gamma/Normal, multinomial/Dirichlet, binomial/beta. Dirichlet mixtures.
- Notable HMMs in bioinformatics.
- Alignment benchmarks and simulations.
- Phylo-HMMs and Pair-HMMs.
- EM algorithm for continuous-time Markov chains.
- MCMC. Basic ideas (movesets, ergodicity, reversibility, mixing, burn-in time), Metropolis-Hastings, Gibbs & reversible-jump approaches. The Lawrence, Liu et al sampler for ungapped motifs.
- Simulation. Gillespie algorithm, Yule process, coalescent simulations. MCMC over tree topologies; common moves. Data compression; Kraft's inequality, arithmetic coding.
- Structural EM. The algorithm of Friedman, Pupko et al applying structural EM to infer the topology of phylogenetic trees.
- The TKF91 model. Evolutionary grammars, symmetry properties (specifically, reversibility & homomorphic maps), the point indel model, the separation of an alignment into independently evolving zones, the TKF91 formulation in terms of "mortal links" & "immortal links", the birth-death process equations for the zone fates (and the ubiquity of birth-death processes in general), the geometric form of the solutions for the zone fate probabilities, the mapping of these probabilities onto a Pair HMM, and three different ways of arriving at the solutions: trial solution, Metzler arguments, generating functions.
- Sequence transducers. Formulation. Composition.
- Statistical alignment: TKF91, Holmes-Bruno, Redelings-Suchard, other samplers.
- Heuristic algorithms for indel reconstruction. Indelign, and analysis thereof. Ortheus.
- Breakpoint graphs. Haussler et al's "infinite sites" model.
- SCFGs. CYK. Inside-Outside. Pair SCFGs. Acceleration/approximation (i.e. the "constrained Sankoff" algorithm).
- Evolution of RNA. The TKF Structure Tree model as a set of recursively nested birth-death (TKF91) processes.
- Modeling multiple alignments of RNA and reconstructing ancient RNA. Tree transducers and composition algorithms. Analysis of grammars. Ortheus-like approaches.
- Viral genomes, genefinding, evolution and statistical alignment.
- Gene ontology; phylogenetic inference of gene annotations. SIFTER et al.
- Gene family evolution. Arvestad et al. Hahn et al.
- Fokker-Planck equation. A taste of continuous-state systems: the Kramers-Moyal expansion, the Fokker-Planck equation, and notable special cases (Liouville process, Brownian motion, Ornstein-Uhlenbeck process, continuous approximation to birth-death process). Application: Samoilov, Arkin et al's treatment of how noisy forward-enzyme concentrations perturb the Michaelis-Menten kinetics of the enzyme futile cycle.
All class materials (lecture slides and lab handouts) are available at github: https://github.com/ihh/bioe241
Primary texts (probabilistic modeling and bioinformatics)
- The MacKay Book. MacKay. Information Theory, Inference and Learning Algorithms. ISBN:0521642981
- Can be downloaded from here (copyright grants permission to view but not print)
Background on molecular evolution & algorithms for doing it:
For continuous-valued Markov processes, Gaussian & otherwise:
- Rasmussen and Williams. Gaussian Processes for Machine Learning. ISBN:026218253X
- The SpikesBook. Rieke, Warland, de Ruyter van Steveninck and Bialek. Spikes: Exploring the Neural Code. ISBN:0262681080
- An excellent introduction to information theory & Bayesian analysis in computational neuroscience.
- Eliasmith and Anderson. Neural Engineering: Computation, Representation, and Dynamics in Neurobiological Systems. ISBN:0262050714
- Matrix exponentiation
- Coalescent theory
- Chapter 1 of Hein, Schierup and Wiuf (2005). Gene Genealogies, Variation and Evolution: A Primer in Coalescent Theory. ISBN:0198529961 (Amazon)
- Network evolution
- That Stanford ribo-happening video
Copyright © 2008-2013 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback