Soren Mork

From Biowiki
Jump to: navigation, search
  • Name: Sшren Mшrk (a.k.a Soeren Moerk/Soren Mork), first name as in Sшren Kierkegaard, last name as in the Danish word for "Dark".
  • Email: soer@ruc.dk
  • [http: //akira.ruc.dk/~soer/ Homepage]

I'm a Ph.D student visiting from Roskilde University in Denmark. I work on biological sequence analysis using probabilistic logic programming as part of the LoSt project.

Being a stamp collector (in the Rutherfordian way) suffering from serious physics envy, I have always been attracted to the fundamental side of how Life works.

Originally a molecular/evolutionary biologist I did my masters on the evolutionary conservation of alternative spliceform expression, but have throughout the years poked my nose into various topics such as astrobiology, the origin of life, ancient DNA, etc.

Now, a converted bioinformatician, I have left my math anxiety somewhat behind and surrendered myself to computational/statistical modeling of the molecular evolution of biological sequences.

Bioinformatics/computational biology is the epitome of the two most important trends in present day science: 1 the explosive growth in the amount of data available (the data deluge) and 2: the eerie convergence of the underlying theories used to describe various previously unconnected phenomena. This is great stuff!, you can use logic, and set theory, and recursion theory and graph theory and probability theory, and game theory, methods from Artificial Intelligence, Statistical Physics, Linguistics, Neuroscience, Robotics, Operations Research, etc, etc, and it really works quite well for living stuff also! ).

To my mind, molecular evolution is a fitting biological analog of quantum-relativity, the union of the smallest and the biggest to one grand unified theory of Life! And it's all about strings, too! Although biostrings are somewhat different than superstrings.

DNA, RNA and proteins are sequential strings, enabling 1: information transfer in the linear phase and 2: catalysis/4D-interactions in the folded phase, the two hallmarks of self-reproducability that characterize Life.

Statistical models of various aspects of biological sequences are becoming well established, with current efforts focused on making more realistic models via relaxing some of the simplifying assumptions brought about in order to introduce these nice rigorous quantitative models (some sort of iid model for starters), without loosing their explanatory power.

A few examples hereof includes the incorporation of time-ireversible models of nucleotide substitution, the modeling of indel processes on phylogenetic trees, and the use of variable order markov chains or stochastic context free grammars in gene finding (all expanding on the dependencies of a single site, through time (evolution) and space ("neighboring" sites)).

Ideally, probabilistic models of biological sequences should be able to perform simultaneous alignment, phylogeny and structure/function decoding, as these different aspects of biological sequences are truly intermingled.

Separately, the main problems of this grand scale problem are known as:

  • the alignment problem,
  • the phylogeny problem,
  • the decoding problem,
  • the reconstruction problem,
  • the annotation problem, and
  • the folding problem.

And if doing all of this at once isn't enough, a lot (most?) biological sequences do more than one thing at a time, i.e. a single residue can be subject to constraints from binding motifs in its DNA sequence, RNA structure constraints and protein structure constraints (with different spliceform functions), which also needs to be taken into account when dealing even with a single site.

Understandably, most efforts tackles each of these problems separately, fixing the other problems for the time being.

However, by combining different models on the dependencies of single sites it should be possible to construct more realistic models, to examine the interplay between the different functions, as well as to perform better classification/prediction of various biological sequences.

Motivating example: There has been a major surge in interest in RNAs within the recent decades, sparked by the discovery of catalytic RNAs, the RNA world hypothesis for the origin of Life, the fact that RNAs undergo a range of post-transcriptional modifications (such as alternative splicing) that vastly enlarges the transcriptome and the widespread functions and importance of various regulatory RNAs.

Yet, the perhaps most well known group of RNA's the mRNAs are rarely treated as RNAs. Instead, almost all models of mRNA deal exclusively with their protein coding potential, effectively modeling them only as mirages of the protein structure/function they encode.

My current focus is on modeling overlapping functions (e.g. RNA structure + protein coding of mRNAs) using factorial HMM's & factorial grammars formulated via probabilistic logic programming. Probabilistic Logic programming languages (e.g. PRISM) combines the strenght of declarative programing with probabilistic inference. This allows one to "easily" formulate and (almost) freely combine various discrete probabilistic models spanning both probabilistic graphical models, stochastic grammars, as well as stochastic relational models, at the small cost of using excesive amounts of memory.

On my list of things to do (some of it done, some of it possibly impossible) is to implement (and utilize) as probabilistic logic programs: HMM's, pair-HMM's, transducers, SCFG's, pair-SCFG's, transducer SCFG's, Bayesian Networks, factorial HMM's, factorial SCFG's, higher order HMM's/SCFG's/transducers, and all sorts of combinations of different models nested within other models etc, most of which probably will be intractable and/or scale like a bitch.

PRISM has an inbuilt EM algorithm that runs on explanation graphs of probabilistic logic programs, meaning that it is possible to learn from almost any model formulated as a logic program (given enough RAM and a few sometimes important conditions) i.e. most (if not all) of the above mentioned ones.

Other probabilistic logic programming languages support MCMC, with some nice properties for proposal mechanisms (just go back up the proof tree and down again) that I'll explore in some detail if EM does not do the trick.

People I hang around using my characteristic walk and stalk tactics include:

  • Henning Christiansen
  • Ian Holmes
  • John Huelsenbech
  • Bjarne Knudsen
  • Anders Krogh
  • Jakob Skou Pedersen
  • Kasper Munch
  • Rasmus Nielsen
  • Taisuke Sato
  • Ole Skovgaard
  • Ole Torp Lassen
  • Eske Willerslev

Being the father of 3 (aged 1,3 and 6), my main hobbies are: laundry, cleaning, cooking and grocery shopping.

and yes, my brother really is the Ponk & Roll musician thomas mњrk...

</td> <td valign="top" align="right"> <img height=300 src="Media:SorenMork.03312008079.jpg"> </td> </tr> </table>

My Links

  • TWiki.WelcomeGuest to learn TWiki
  • Sandbox.WebHome web to try out TWiki
  • Sandbox.SorenMorkSandbox just for me


Related topics

   SorenMork.goldengate.jpg