Difference between revisions of "Xrate Tutorial"
(Imported from TWiki) 
m (Move page script moved page XrateTutorial to Xrate Tutorial: Rename from TWiki to MediaWiki style) 
(No difference)

Latest revision as of 23:43, 1 January 2017
Contents
 1 XRATE tutorial
 1.1 Overview
 1.2 Creating a JukesCantor model by hand
 1.3 Using macro loops to create a JukesCantor model
 1.4 Changing the alphabet from DNA to proteins
 1.5 Introducing rate variation with hidden classes (Phast Conslike model)
 1.6 Using Scheme function calls to model Gammadistributed rate classes (Phast Cons  animal style)
XRATE tutorial
Overview
This tutorial describes the basic ingredients of an XRATE grammar, and details how one can go from an ultrasimple ( JukesCantor model, all columns IID, defined by hand ) model to progressively more complex models (loopdefined JukesCantor, Phast Consstyle phyloHMM, Gammadistributed rates using loops+Scheme calls) with just a couple steps.
It should be noted that all of this is (in theory) covered on the Xrate Software and Xrate Format pages, with more detail provided on each of those pages. It is our (the developers/contributors/users of XRATE and DART in general) hope that this tutorial (and recent paper in submission) will make the modelspecifying capabilities of XRATE clear and accessible to a broad audience of phylogenetic enthusiasts.
Also, the following paper describes this material in greater detail:
Westesson & Holmes: Developing and applying heterogeneous phylogenetic models with XRate. PLoS ONE 2012;7:e36898.
Creating a JukesCantor model by hand
Let's take a look at the familiar JukesCantor model. This is possibly the simplest continuoustime Markov chain (CTMC) modeling the evolution of nucleotides: it states that all changes happen at equal rates. There is one rate parameter, u. We'll need a few more ingredients besides the rate parameter to define our grammar: production rules, an alphabet, and an equilibrium distribution.
Production rules define how alignment columns are created by the grammar: which CTMCs generate each column, and how are these correlated?. In this case, we're not (yet) modeling any sitetosite variation (e.g. the columns of our alignment are IID), so we'll use the simple production rule S > XS* , which simply means that each column is generated by the terminal "X". The other rules are trivial continuation: S*>S and ending: S>() rules that aren't superimportant right now. We'll also define our rate u to take the value .333.
(grammar (transform (from (S)) (to (X S*))) (transform (from (S*)) (to (S)) (prob 1)) (transform (from (S*)) (to ()) (prob 1)) (rate (u .3333))
Note that we start the grammar definition with <verb> (grammar </verb>: the grammar definition is everything which occurs between this and the closing ).
Next we'll define the equilibrium probability distribution over our alphabet characters and the mutation rates between them. The equilibrium (or "initial") distribution is the proportions of the characters we would expect at the root of the tree. If one assumes that observed sequences are taken from the equilibrium distribution, then estimating this is trivially easy  just compute the observed frequencies of the different characters. Here we set each to 0.25 for simplicity, even though this is known not to be the case in real genomes.
Mutation parameters describe the substitution rates between alphabet characters. Here we define each of the rates between different characters to be the value of the parameter u. There is no need to define the rate between a character and itself  XRATE will automatically normalize the rows of the substitution matrix (so that these "self mutations" are set to the negative sum of the remaining possible substitutions).
(chain (terminal (X)) ;; initial probability distribution (initial (state (a)) (prob .25)) (initial (state (c)) (prob .25)) (initial (state (g)) (prob .25)) (initial (state (t)) (prob .25)) ;; mutation rates (mutate (from (a)) (to (c)) (rate u)) (mutate (from (a)) (to (g)) (rate u)) (mutate (from (a)) (to (t)) (rate u)) (mutate (from (c)) (to (g)) (rate u)) (mutate (from (c)) (to (a)) (rate u)) (mutate (from (c)) (to (t)) (rate u)) (mutate (from (g)) (to (c)) (rate u)) (mutate (from (g)) (to (a)) (rate u)) (mutate (from (g)) (to (t)) (rate u)) (mutate (from (t)) (to (c)) (rate u)) (mutate (from (t)) (to (g)) (rate u)) (mutate (from (t)) (to (a)) (rate u)) ) ;; end chain );; end grammar
Lastly, we'll need to specify an alphabet to use. For now, let's use the standard DNA alphabet (ACGT), with the only ambiguous character being "*":
(alphabet (name DNA) (token (a c g t)) (wildcard *)) ;; end alphabet DNA
Using macro loops to create a JukesCantor model
That was pretty straightforward, but it was a bit of work typing in all those character combinations. We could have instead used XRATE's macro syntax to help us define the model. Essentially, the grammar format has two levels of automation/scripting built in  the macro preprocessor and Scheme extensions (both are described in detail at Xrate Format). The former allows for loops, conditionals, basic arithmetic, certain list operations, and more. These capabilities are enough to create an extremely wide range of possible models. To illustrate some of the features of macros and Scheme extensions, we'll build upon the JC69 model. First, let's recast the same model as above using a macro loop:
(&foreachtoken tok1 (&foreachtoken tok2 (&if (&eq tok1 tok2) () ;; XRate will normalize each matrix row (mutate (from (tok1)) (to (tok2)) (rate u)) ) ) )
This loops through each pair of tokens (the &foreachtoken construct is provided for us as part of the macro language), and if tok1 and tok2 are different, the mutation rate between them is set to u. The resulting grammar (e.g. the likelihood of an alignment under this grammar) would be the same as our initial direct enumeration, but loops allow for a much more compact definition. The full grammar is available at jukescantor.eg
Changing the alphabet from DNA to proteins
Along with compactness, macros enable a certain flexibility in the model. To illustrate this, say we wanted to define the same model (where all mutations are equally likely) but for protein (amino acid) sequences. This would be a serious undertaking if done manually  we'd have to specify rates between almost 200 pairs of characters. However, if we use the macro loop above, the definition stays the same length. Since we used the provided &foreachtoken loop construct, we'll visit whatever characters are defined in the (alphabet ...) clause, whether it be DNA or protein characters.
We will need to redefine the equilibrium distribution, which can either be done manually (since there are only 20 amino acids), or using another (&foreachtokens ...) loop. If we want to do it in a loop, we can combine the &TOKENS variable with the subtraction and division operators to set the equilibrium probability of each character equal to 1/(size of the alphabet) (e.g. a uniform distribution), like this:
(&foreachtoken tok1 (initial (state (tok1)) (prob (&div 1 (& &TOKENS 1)) )))
To change the alphabet to amino acids, we simply replace the DNA alphabet definition with this one:
(alphabet (name Protein) (token (a r n d c q e g h i l k m f p s t w y v)))
Suppose we want to model variation in substitution rate among columns (e.g. nonIID columns). We can do this in XRATE using a Phylo HMM  an HMM that emits phylogenetic alignment columns (the previous example was actually also an HMM, though not a very interesting one since there was only one state). The high level idea is that there is a hidden sequence of states (one for each column) that decides the rate at which each alignment column evolves. In using a model like this, we can either sum over all such state sequences (e.g. to account for this rate heterogeneity), or we may want to infer/predict what that state sequence was (e.g. detect regions evolving at different rates).
XRATE can help us do both of these tasks (plus more). First we'll need to define the ingredients of our model. We'll stick with the JukesCantorstyle model for now, and just build on top of it. Let's say we want to have three rate categories, corresponding to slow, normal, and fast evolutionary rates. We'll define three rate parameters r_1, r_2, r_3 , such that the substitution rates for each state's character evolution model will be scaled by this parameter (e.g. columns generated by state 1 will evolve at rate u*r_1 ).
We define the chain corresponding to state 1 like this: ( the chains corresponding to state 2 and 3 can be defined in the same way, substituting chain_2 and r_2 where appropriate):
(chain (terminal (chain_1)) (&foreachtoken tok1 (&foreachtoken tok2 (&if (&eq tok1 tok2) () ;; XRate will normalize each matrix row (mutate (from (tok1)) (to (tok2)) (rate u*r_1)) ) ) ) ) (chain (terminal (chain_2)) (&foreachtoken tok1 (&foreachtoken tok2 (&if (&eq tok1 tok2) () ;; XRate will normalize each matrix row (mutate (from (tok1)) (to (tok2)) (rate u*r_2)) ) ) ) ) (chain (terminal (chain_3)) (&foreachtoken tok1 (&foreachtoken tok2 (&if (&eq tok1 tok2) () ;; XRate will normalize each matrix row (mutate (from (tok1)) (to (tok2)) (rate u*r_3)) ) ) ) )
Note that this is somewhat redundant  each of the chains has nearly the same definition. We could instead use a loop over integers to define the three chains more compactly. The construct (&foreachinteger VAR (START END) ... ) sets VAR equal to all integers from START to END one after another. We'll also need to make use of the concatenation operator: (&cat A B) results in AB (in order to coordinate the chain names and rate parameters). The same set of definitions as above, making use of loops, looks like this (about 1/3 as long):
(&foreachinteger state (1 3) (chain (terminal ((&cat chain_ state))) (&foreachtoken tok1 (&foreachtoken tok2 (&if (&eq tok1 tok2) () ;; XRate will normalize each matrix row (mutate (from (tok1)) (to (tok2)) (rate u*(&cat r_ state))) ) ) ) )
Next we'll need to define the transitions between the hidden states. For simplicity's sake, we'll set all these transitions equal (making it less of an HMM and more of a mixture model). These could, however, be defined as probability parameters and we could ask XRATE to estimate them for us. We can define all the rules to have probability 1
Since there are 3 + 9 + 3 transitions to be defined (start > state_*, state_* > state_*, and state_* > () for * = 1,2, or 3), we'll again use macro loops over integers to create these definitions.
(&define oneThird 0.3333) ;; Each state has a transition from start (&foreachinteger state (1 3) (transform (from (start)) (to (state1)) (prob oneThird ))) ;; Each state can transition to any other state (&foreachinteger state (1 3) (&foreachinteger state2 (1 3) (transform (from (state)) (to (state2)) (prob oneThird ))))
Next we'll define how the emission terminals are related to the nonterminals. This is a trivial relation  state_1 emits chain_1, state_2 emits chain_2, etc. The way this is defined in XRATE language is like this:
(&foreachinteger state (1 3) ;; Each state emits a chain of the same name (transform (from (state))(to ((&cat chain_ state) (&cat state *)))))
The "*" is an indicator that the nonterminal has just emitted a column. We also need to define transitions to the 'end', or null state. This is so that the alignment generation process can end. In practice, the input alignment defines the possible lengths of nonterminal sequences, so we can leave this as a probability 1 event.
;; Each state can transition to end  we assign this prob 1 ;; since the alignment length directs when this transition occurs (transform (from ((&cat state *))) (to ()) (prob 1))
The full grammar (with a slightly different parametrization) can be found in the dart/grammars directory: The full grammar is available at conservation_phylohmm.eg
Using Scheme function calls to model Gammadistributed rate classes (Phast Cons  animal style)
One question you may have is "where did the rate parameters r_1, r_2, r_3 come from? Well, we could estimate them from data using XRATE's training/parameter estimation machinery. Alternatively, it's common to use a mixture of Gammadistributed rates, since this has been seen to fit a wide range of empirical data reasonably well.
Xrate provides an implementation of the gamma function and associated functions; however, to access these, we must go beyond the simple loopexpansion macros, and call functions from Scheme. (This means xrate must be compiled with GNU Guile.) In the snippet below, we import some necessary modules, and then call the function (discretegammamedians alpha beta k) which returns the k quantiles of the gamma distribution for shape and scale parameters alpha and beta:
;; Begin by defining the number of rate classes and the gamma shape parameter (scale parameter i\ s fixed at 1) (&define CLASSES 4) (&define SHAPE 0.5) (&schemediscard ;; Import Scheme list library (usemodules (srfi srfi1)) ;; Define rate classes ;; Uses the DARTprovided Scheme function (discretegammamedians alpha beta K) (define RATES (discretegammamedians SHAPE SHAPE CLASSES)))
Now lets use this list of rates in our Phast Consstyle HMM. Accessing values in the vector RATES requires some Scheme macro/gymnastics 9e.g. using listref and adjusting the index to zerobased), but basically all we have to do is replace r_1, r_2 and r_3 with the values in RATES . Here's the loop defining the substitution models again, now using the Gammadistributed rates:
(&foreachinteger state (1 3) (chain (terminal ((&cat chain_ state))) (&foreachtoken tok1 (&foreachtoken tok2 (&if (&eq tok1 tok2) () ;; XRate will normalize each matrix row (mutate (from (tok1)) (to (tok2)) (rate u*((&scheme (listref RATES (& CLASS 1)))))) ) ) ) ) )
The sumoverpaths alignment likelihood under this model is analogous to Yang's 'spacetime' model for autocorrelated rate variation (although using a uniform transition distribution as we did does not give any actual correlation. The full grammar is available at autodiscgamma.eg
Here's the Yang paper describing this model:
Yang &: A spacetime process model for the evolution of DNA sequences. Genetics 1995;139:9931005.
A collection of all the above grammars, trained and untrained, small and large test data, together with a Makefile explaining the usage (type 'make help' to see the available commands, or just read the Makefile yourself) can be found here: example_grammars.zip
 Oscar Westesson  26 June 2012