# Xrate Tutorial

## Contents

- 1 XRATE tutorial
- 1.1 Overview
- 1.2 Creating a Jukes-Cantor model by hand
- 1.3 Using macro loops to create a Jukes-Cantor model
- 1.4 Changing the alphabet from DNA to proteins
- 1.5 Introducing rate variation with hidden classes (Phast Cons-like model)
- 1.6 Using Scheme function calls to model Gamma-distributed rate classes (Phast Cons - animal style)

# XRATE tutorial

## Overview

This tutorial describes the basic ingredients of an XRATE grammar, and details how one can go from an ultra-simple ( Jukes-Cantor model, all columns IID, defined by hand ) model to progressively more complex models (loop-defined Jukes-Cantor, Phast Cons-style phyloHMM, Gamma-distributed rates using loops+Scheme calls) with just a couple steps.

It should be noted that all of this is (in theory) covered on the Xrate Software and Xrate Format pages, with more detail provided on each of those pages. It is our (the developers/contributors/users of XRATE and DART in general) hope that this tutorial (and recent paper in submission) will make the model-specifying capabilities of XRATE clear and accessible to a broad audience of phylogenetic enthusiasts.

Also, the following paper describes this material in greater detail:

Westesson & Holmes: Developing and applying heterogeneous phylogenetic models with XRate. *PLoS ONE* 2012;7:e36898.

## Creating a Jukes-Cantor model by hand

Let's take a look at the familiar Jukes-Cantor model. This is possibly the simplest continuous-time Markov chain (CTMC) modeling the evolution of nucleotides: it states that all changes happen at equal rates. There is one rate parameter, u. We'll need a few more ingredients besides the rate parameter to define our grammar: production rules, an alphabet, and an equilibrium distribution.

Production rules define how alignment columns are created by the grammar: which CTMCs generate each column, and how are these correlated?. In this case, we're not (yet) modeling any site-to-site variation (e.g. the columns of our alignment are IID), so we'll use the simple production rule ** S -> XS* **, which simply means that each column is generated by the terminal "X". The other rules are trivial continuation: **S*->S** and ending: **S->()** rules that aren't super-important right now. We'll also define our rate u to take the value .333.

(grammar (transform (from (S)) (to (X S*))) (transform (from (S*)) (to (S)) (prob 1)) (transform (from (S*)) (to ()) (prob 1)) (rate (u .3333))

Note that we start the grammar definition with <verb> (grammar </verb>: the grammar definition is everything which occurs between this and the closing **)**.

Next we'll define the equilibrium probability distribution over our alphabet characters and the mutation rates between them. The equilibrium (or "initial") distribution is the proportions of the characters we would expect at the root of the tree. If one assumes that observed sequences are taken from the equilibrium distribution, then estimating this is trivially easy - just compute the observed frequencies of the different characters. Here we set each to 0.25 for simplicity, even though this is known not to be the case in real genomes.

Mutation parameters describe the substitution rates between alphabet characters. Here we define each of the rates between different characters to be the value of the parameter **u**. There is no need to define the rate between a character and itself - XRATE will automatically normalize the rows of the substitution matrix (so that these "self mutations" are set to the negative sum of the remaining possible substitutions).

(chain (terminal (X)) ;; initial probability distribution (initial (state (a)) (prob .25)) (initial (state (c)) (prob .25)) (initial (state (g)) (prob .25)) (initial (state (t)) (prob .25)) ;; mutation rates (mutate (from (a)) (to (c)) (rate u)) (mutate (from (a)) (to (g)) (rate u)) (mutate (from (a)) (to (t)) (rate u)) (mutate (from (c)) (to (g)) (rate u)) (mutate (from (c)) (to (a)) (rate u)) (mutate (from (c)) (to (t)) (rate u)) (mutate (from (g)) (to (c)) (rate u)) (mutate (from (g)) (to (a)) (rate u)) (mutate (from (g)) (to (t)) (rate u)) (mutate (from (t)) (to (c)) (rate u)) (mutate (from (t)) (to (g)) (rate u)) (mutate (from (t)) (to (a)) (rate u)) ) ;; end chain );; end grammar

Lastly, we'll need to specify an alphabet to use. For now, let's use the standard DNA alphabet (ACGT), with the only ambiguous character being "*":

(alphabet (name DNA) (token (a c g t)) (wildcard *)) ;; end alphabet DNA

## Using macro loops to create a Jukes-Cantor model

That was pretty straightforward, but it was a bit of work typing in all those character combinations. We could have instead used XRATE's macro syntax to help us define the model. Essentially, the grammar format has two levels of automation/scripting built in - the macro preprocessor and Scheme extensions (both are described in detail at Xrate Format). The former allows for loops, conditionals, basic arithmetic, certain list operations, and more. These capabilities are enough to create an extremely wide range of possible models. To illustrate some of the features of macros and Scheme extensions, we'll build upon the JC69 model. First, let's recast the same model as above using a macro loop:

(&foreach-token tok1 (&foreach-token tok2 (&if (&eq tok1 tok2) () ;; XRate will normalize each matrix row (mutate (from (tok1)) (to (tok2)) (rate u)) ) ) )

This loops through each pair of tokens (the **&foreach-token** construct is provided for us as part of the macro language), and if **tok1** and **tok2** are different, the mutation rate between them is set to **u**. The resulting grammar (e.g. the likelihood of an alignment under this grammar) would be the same as our initial direct enumeration, but loops allow for a much more compact definition. The full grammar is available at jukescantor.eg

## Changing the alphabet from DNA to proteins

Along with compactness, macros enable a certain *flexibility* in the model. To illustrate this, say we wanted to define the same model (where all mutations are equally likely) but for protein (amino acid) sequences. This would be a serious undertaking if done manually - we'd have to specify rates between almost 200 pairs of characters. However, if we use the macro loop above, the definition stays the same length. Since we used the provided **&foreach-token** loop construct, we'll visit whatever characters are defined in the **(alphabet ...)** clause, whether it be DNA or protein characters.

We will need to redefine the equilibrium distribution, which can either be done manually (since there are only 20 amino acids), or using another **(&foreach-tokens ...)** loop. If we want to do it in a loop, we can combine the **&TOKENS** variable with the subtraction and division operators to set the equilibrium probability of each character equal to 1/(size of the alphabet) (e.g. a uniform distribution), like this:

(&foreach-token tok1 (initial (state (tok1)) (prob (&div 1 (&- &TOKENS 1)) )))

To change the alphabet to amino acids, we simply replace the DNA alphabet definition with this one:

(alphabet (name Protein) (token (a r n d c q e g h i l k m f p s t w y v)))

Suppose we want to model variation in substitution rate among columns (e.g. non-IID columns). We can do this in XRATE using a Phylo HMM - an HMM that emits phylogenetic alignment columns (the previous example was actually also an HMM, though not a very interesting one since there was only one state). The high level idea is that there is a hidden sequence of states (one for each column) that decides the rate at which each alignment column evolves. In using a model like this, we can either sum over all such state sequences (e.g. to *account* for this rate heterogeneity), or we may want to infer/predict what that state sequence was (e.g. *detect* regions evolving at different rates).

XRATE can help us do both of these tasks (plus more). First we'll need to define the ingredients of our model. We'll stick with the Jukes-Cantor-style model for now, and just build on top of it. Let's say we want to have three rate categories, corresponding to slow, normal, and fast evolutionary rates. We'll define three rate parameters ** r_1, r_2, r_3 **, such that the substitution rates for each state's character evolution model will be scaled by this parameter (e.g. columns generated by state 1 will evolve at rate ** u*r_1 **).

We define the chain corresponding to state 1 like this: ( the chains corresponding to state 2 and 3 can be defined in the same way, substituting **chain_2** and **r_2** where appropriate):

(chain (terminal (chain_1)) (&foreach-token tok1 (&foreach-token tok2 (&if (&eq tok1 tok2) () ;; XRate will normalize each matrix row (mutate (from (tok1)) (to (tok2)) (rate u*r_1)) ) ) ) ) (chain (terminal (chain_2)) (&foreach-token tok1 (&foreach-token tok2 (&if (&eq tok1 tok2) () ;; XRate will normalize each matrix row (mutate (from (tok1)) (to (tok2)) (rate u*r_2)) ) ) ) ) (chain (terminal (chain_3)) (&foreach-token tok1 (&foreach-token tok2 (&if (&eq tok1 tok2) () ;; XRate will normalize each matrix row (mutate (from (tok1)) (to (tok2)) (rate u*r_3)) ) ) ) )

Note that this is somewhat redundant - each of the chains has nearly the same definition. We could instead use a loop over integers to define the three chains more compactly. The construct ** (&foreach-integer VAR (START END) ... ) ** sets ** VAR ** equal to all integers from ** START ** to ** END ** one after another. We'll also need to make use of the concatenation operator: ** (&cat A B) ** results in ** AB ** (in order to co-ordinate the chain names and rate parameters). The same set of definitions as above, making use of loops, looks like this (about 1/3 as long):

(&foreach-integer state (1 3) (chain (terminal ((&cat chain_ state))) (&foreach-token tok1 (&foreach-token tok2 (&if (&eq tok1 tok2) () ;; XRate will normalize each matrix row (mutate (from (tok1)) (to (tok2)) (rate u*(&cat r_ state))) ) ) ) )

Next we'll need to define the transitions between the hidden states. For simplicity's sake, we'll set all these transitions equal (making it less of an HMM and more of a mixture model). These could, however, be defined as probability parameters and we could ask XRATE to estimate them for us. We can define all the rules to have probability 1

Since there are 3 + 9 + 3 transitions to be defined **(start -> state_*, state_* -> state_***, and **state_* -> ()** for * = 1,2, or 3), we'll again use macro loops over integers to create these definitions.

(&define oneThird 0.3333) ;; Each state has a transition from start (&foreach-integer state (1 3) (transform (from (start)) (to (state1)) (prob oneThird ))) ;; Each state can transition to any other state (&foreach-integer state (1 3) (&foreach-integer state2 (1 3) (transform (from (state)) (to (state2)) (prob oneThird ))))

Next we'll define how the emission terminals are related to the nonterminals. This is a trivial relation - **state_1** emits **chain_1**, **state_2** emits **chain_2**, etc. The way this is defined in XRATE language is like this:

(&foreach-integer state (1 3) ;; Each state emits a chain of the same name (transform (from (state))(to ((&cat chain_ state) (&cat state *)))))

The "*" is an indicator that the nonterminal has just emitted a column. We also need to define transitions to the 'end', or null state. This is so that the alignment generation process can end. In practice, the input alignment defines the possible lengths of nonterminal sequences, so we can leave this as a probability 1 event.

;; Each state can transition to end - we assign this prob 1 ;; since the alignment length directs when this transition occurs (transform (from ((&cat state *))) (to ()) (prob 1))

The full grammar (with a slightly different parametrization) can be found in the dart/grammars directory: The full grammar is available at conservation_phylohmm.eg

## Using Scheme function calls to model Gamma-distributed rate classes (Phast Cons - animal style)

One question you may have is "where did the rate parameters ** r_1, r_2, r_3 ** come from? Well, we could estimate them from data using XRATE's training/parameter estimation machinery. Alternatively, it's common to use a mixture of Gamma-distributed rates, since this has been seen to fit a wide range of empirical data reasonably well.

Xrate provides an implementation of the gamma function and associated functions; however, to access these, we must go beyond the simple loop-expansion macros, and call functions from Scheme. (This means xrate must be compiled with GNU Guile.) In the snippet below, we import some necessary modules, and then call the function ** (discrete-gamma-medians alpha beta k) ** which returns the k quantiles of the gamma distribution for shape and scale parameters alpha and beta:

;; Begin by defining the number of rate classes and the gamma shape parameter (scale parameter i\ s fixed at 1) (&define CLASSES 4) (&define SHAPE 0.5) (&scheme-discard ;; Import Scheme list library (use-modules (srfi srfi-1)) ;; Define rate classes ;; Uses the DART-provided Scheme function (discrete-gamma-medians alpha beta K) (define RATES (discrete-gamma-medians SHAPE SHAPE CLASSES)))

Now lets use this list of rates in our Phast Cons-style HMM. Accessing values in the vector ** RATES ** requires some Scheme macro/gymnastics 9e.g. using list-ref and adjusting the index to zero-based), but basically all we have to do is replace ** r_1, r_2** and **r_3 ** with the values in ** RATES **. Here's the loop defining the substitution models again, now using the Gamma-distributed rates:

(&foreach-integer state (1 3) (chain (terminal ((&cat chain_ state))) (&foreach-token tok1 (&foreach-token tok2 (&if (&eq tok1 tok2) () ;; XRate will normalize each matrix row (mutate (from (tok1)) (to (tok2)) (rate u*((&scheme (list-ref RATES (&- CLASS 1)))))) ) ) ) ) )

The sum-over-paths alignment likelihood under this model is analogous to Yang's 'space-time' model for autocorrelated rate variation (although using a uniform transition distribution as we did does not give any actual correlation. The full grammar is available at autodiscgamma.eg

Here's the Yang paper describing this model:

Yang &: A space-time process model for the evolution of DNA sequences. *Genetics* 1995;139:993-1005.

A collection of all the above grammars, trained and untrained, small and large test data, together with a Makefile explaining the usage (type 'make help' to see the available commands, or just read the Makefile yourself) can be found here: example_grammars.zip

-- Oscar Westesson - 26 June 2012