Research  Teaching 



xrate file format
Introduction
The Participle, with an Article before it, and the Preposition of after it, becomes a Substantive, expressing the action itself which the Verb signifies: as, "These are the Rules of Grammar, by the observing of which you may avoid mistakes."
This document describes the grammar file format for the xrate phylogrammar alignment analysis package. Please see the xrate software page for more info on the program itself (and e.g. here for more info on phylogrammars).
Example xrate grammar files
The Here are a few examples of working xrate grammar files. The techniques illustrated here can be mixed and matched. Some of the grammars use xrate macros, which is like a tiny lisplike dialect for specifying repetitivelystructured grammars.
Point substitution models
Grammars that implement point substitution models have two (almost trivial) rules:
These grammar files, then, effectively just illustrate the file format for the substitution rate matrix & the notational principle of tying rate matrices to grammars using pseudoterminals:
The above xrate files illustrate the idea of a basic point substitution model. The following xrate files combine several such models, using a grammar to describe how different substitution models are used for different alignment columns.
Feature predictors
Lineagespecific evolutionary grammars
Sitespecific models
Grammars that use the Scheme interpreter
Overall structure of grammar filesThe XrateFormat for specifying evolutionary contextfree grammars uses LispSExpressions and consists of several parts:
These parts are described in more detail below.
Grammar file syntax
The complete syntax of xrate's Sexpressionbased format (as checked by xrate's builtin syntax validator) can be printed by specifying the commandline option
AlphabetsAlphabet declarations must contain
Example: DNA
(alphabet (name DNA) (token (a c g t)) (complement (t g c a)))
Example: proteins
(alphabet (name Protein) (token (a r n d c q e g h i l k m f p s t w y v)))
Example: DNA with IUPAC degeneracies
(alphabet (name DNA) (token (a c g t)) (complement (t g c a)) (extend (to u) (from t)) (extend (to r) (from a) (from g)) (extend (to y) (from c) (from t)) (extend (to m) (from a) (from c)) (extend (to k) (from g) (from t)) (extend (to s) (from c) (from g)) (extend (to w) (from a) (from t)) (extend (to h) (from a) (from c) (from t)) (extend (to b) (from c) (from g) (from t)) (extend (to v) (from a) (from c) (from g)) (extend (to d) (from a) (from g) (from t)) (extend (to n) (from a) (from c) (from g) (from t)) (extend (to x) (from a) (from c) (from g) (from t)) (wildcard *)) Gap characters
If the alphabet includes characters that would normally be interpreted as gaps (i.e. This is useful if gaps are being modeled as a "fifth nucleotide", "twentyfirst amino acid" etc.
Grammars
Name
(grammar (name Kimura2) ;; rest of grammar goes here )
The
Metainformation
(grammar (meta (briefdescription This is the Kimura twoparameter model (Kimura, 1980) .) (graphvizlayoutstyle neato) (readmeurl http://biowiki.org/XrateFormat) ;; other meta information goes here ) ;; rest of grammar goes here )
The
It's up to helper applications to define their own keywords for structuring information inside
Parametric grammars
A very common tag is
Update directivesIf a grammar is not designated "parametric", it follows that its de facto parametric structure is just that specified by the structure of the xrateformat file: namely, sets of rule probabilities and rate parameters for the various substitution models. This is syntactically simpler, but less expressive. In this simpler case, grammars can include several tags indicating how the grammar should be "trained".
This sort of control over the EM training algorithm is effectively provided for parametric models
via designation of some parameters as
ParametersAn xrate grammar file specifies many rates (for substitution events) and probabilities (for Markov chain equilibria, length distributions & transformation rules). These rates and probabilities can be declared "parametric". Instead of being treated as free variables, specified as numerical constants, that are independently updated during EM updates (as is the default behavior), "parametric" rates (or probabilities) are specified as algebraic functions on a set of parameters associated with the grammar. This allows model designers to richly constrain the parameter space of models, so as to avoid overtraining and similar sparse data problems.
There are two kinds of parameters: rate parameters, and probability parameters.
They can be enclosed in a block named
The
This is somewhat sloppy, and a more rigorous parameter typing can be enforced
by using the
Note that it is YOUR responsibility to ensure that the parameters in
There are almost no restrictions on parameter names.
A parameter name cannot contain characters that are reserved for the Sexpression format (i.e. whitespace, parentheses or semicolons),
nor can it be exactly the same as an arithmetic function or macro (
Rate parametersRate parameters are considered free to vary independently of all other parameters (they must remain nonnegative, but that's the only constraint). They appear as a namevalue pair, enclosed by single or double brackets (both are valid).
Example: Kimura's two parameters
;; The parameters for Kimura's transitiontransversion model. (rate ((alpha 4)) ;; transition rate ((beta 1))) ;; transversion rate This could also be valid written with single brackets around each rate parameter:
(rate (alpha 4) ;; transition rate (beta 1)) ;; transversion rate
The
Probability parametersProbability parameters satisfy the following constraints: (i) they are nonnegative; (ii) all probability parameters in a mutually exclusive set must sum to 1. The syntax for declaring and assigning values to probability parameters is similar to that for rate parameters: each parametervalue is a pair enclosed by parentheses. Sets of mutually exclusive parameters are declared together as a list, enclosed by parentheses.
Example: HKY85's six parametersThis example includes both probability parameters (the base frequencies) and rate parameters (the transition and transversion rates). The base frequencies are a mutually exclusive set.
;; Parameters for the HKY85 model (pgroup ((piA .25) (piC .25) (piG .25) (piT .25))) ;; base frequencies (rate ((alpha 1)) ;; transition rate ((beta 1))) ;; transversion rate
This could be written (more cryptically) as a single
;; Parameters for the HKY85 model (params ((piA .25) (piC .25) (piG .25) (piT .25)) ;; base frequencies ((alpha 1)) ;; transition rate ((beta 1))) ;; transversion rate
Note that, in a
(params ((alpha 1))) ;; transition rate
This confusion between rates and probabilities in
Functions and dimensional constraints
Functions are composed of probability parameters and the binary operators
for multiplication, division and addition (
The caret operator ( To guarantee convergence of the EM algorithm, functions are also required to satisfy dimensional constraints that depend on whether they occur in a rate or probability context, as follows. Probability parameters should be considered dimensionless, and rate parameters have dimensions of (i.e. "per unit time"). Dimensional analysis should be used to ensure that rate functions and probability functions composed of such parameters have the according dimensionality. (Numerical constants can be interpreted "sloppily", i.e. their dimensions can be changed to fit.)
For example, if p and q are probabilities and R is a rate,
then Dimensional constraints are not strictly enforced by xrate, but the EM algorithm may not converge if dimensionally incorrect functions are used. Furthermore, the EM algorithm is not guaranteed to converge if the exponentiation operator is used.
Probability parameters and unobserved mutationsIt's important to note that probability parameters are not just dimensionless multipliers in xrate; they must really correspond to probabilities of events. Any rate expression that uses probability parameters must do so in such a way that the probabilities correspond to legitimate, actual divisions of the event space. Furthermore, every mutuallyexclusive event must be specified. Sometimes this means including "unobserved" mutations in order to complete the mutually exclusive set. A good way to think about events whose rate is given by expressions that combine rate and probability parameters is that an event occurs with a certain rate (specified using the rate parameter); there is then a probabilistic decision about what the type of event is (determined by the probability parameters). Some of the outcomes of this decision may be such that the event is unobserved (or "rejected"). In order to correctly count such occurrences, it's necessary to tell xrate that they exist, even though the unobserved events do not result in changes. The Felsenstein 1981 model may be a useful illustrative example here. In one formulation of this model, "replacement" events occur at a constant rate . In a replacement event, a nucleotide X is replaced by another nucleotide Y, which is chosen with probability . In other words, once a replacement event is determined to have occurred at a nucleotide (X), there is a probabilistic decision as to which nucleotide (Y) will replace it. There is, consequently, a probability of that X and Y are the same nucleotide, in which case the replacement is unobserved. Thus, the effective substitution rate from X is , while the rate of unobserved replacements is .
In order for xrate to accurately estimate probability parameters that are used in rate expressions,
it seems to be necessary to include the unobserved substitutions.
This can be done by specifying them as mutationstoself.
In the above example, the three observed mutations from state
(mutate (from (a)) (to (c)) (rate piC * R)) (mutate (from (a)) (to (g)) (rate piG * R)) (mutate (from (a)) (to (t)) (rate piT * R))
while the fourth unobserved mutation from state
(mutate (from (a)) (to (a)) (rate piA * R))
This fourth mutation, even though it does not affect the likelihood, does affect the EM counts that are computed for Examples of this sort of thing can be found in the parametric JukesCantor, HKY85 and Fels81 grammars distributed with xrate.
The const operator: flagging a parameter as temporarily constant
When a parameter name is preceded by the For example, consider the following two functions. The first function evaluates to x+3, and its derivative w.r.t. x is 1:
(x + 3) The second function still evaluates to x+3, but xrate thinks its derivative w.r.t. x is zero:
((const x) + 3)
Pray (to your preferred deities, patron saints, Unix daemons or cyberspace loa) that you never need to use this construct.
Using it correctly requires a pretty good knowledge of the way that the EM algorithm functions.
This (clearly) is rather dark magic and is not recommended for the casual practicioner.
Examples can be found in DartGrammar:sn.eg, where the
Expected counts
After training, the grammar file contains an For example, training the HKY85 model on a short alignment (Rfam:RF00155) yields the following counts:
(expectedcounts (piA 36.0195) (piC 29.0076) (piG 37.0184) (piT 35.0149) (alpha 1.01919 1.67501) (beta 2.03806 3.3364)) ;; end observedcounts
The interpretation of this block is that the event whose probability is given by
Pseudocounts
Parameterspecific pseudocounts for training can be specified in the file.
These have the same format as the These counts are added to the expected counts obtained at the Estep of EM, and the totals are used in the Mstep. This is equivalent to specifying a Dirichlet prior for probability parameters, or a gamma prior for rate parameters. An example, for HKY85:
(pseudocounts (piA 36.0195) (piC 29.0076) (piG 37.0184) (piT 35.0149) (alpha 1.01919 1.67501) (beta 2.03806 3.3364))
Markov chains
A
Example: unicyclerThis first example is the irreversible "unidirectional cycler":
(chain ; declare a Markov chain (updatepolicy irrev) ; EM update policy (terminal (RX)) ; abstract state label (initial (state (a)) (prob 1)) ; initial distribution (mutate (from (a)) (to (c)) (rate 1)) (mutate (from (c)) (to (g)) (rate 1)) (mutate (from (g)) (to (t)) (rate 1)) (mutate (from (t)) (to (a)) (rate 1))) See here for an illustration of this chain's topology. Example: REV + binary hidden class
This example is a reversible RNA chain with a hidden class variable that can take two states, labeled
(chain (updatepolicy rev) (terminal (NUC)) (hiddenclass (row CLASS) (label (1 2))) ;; initial probability distribution (initial (state (a 1)) (prob 0.0238705)) (initial (state (c 1)) (prob 0.136706)) (initial (state (g 1)) (prob 0.0136832)) (initial (state (u 1)) (prob 0.204652)) (initial (state (a 2)) (prob 0.305866)) (initial (state (c 2)) (prob 0.0413438)) (initial (state (g 2)) (prob 0.170151)) (initial (state (u 2)) (prob 0.103727)) ;; mutation rates (mutate (from (a 1)) (to (c 1)) (rate 0.157809)) (mutate (from (a 1)) (to (g 1)) (rate 0.310445)) (mutate (from (a 1)) (to (u 1)) (rate 0.542682)) (mutate (from (a 1)) (to (a 2)) (rate 0.0196586)) (mutate (from (c 1)) (to (a 1)) (rate 0.0275553)) (mutate (from (c 1)) (to (g 1)) (rate 0.0153761)) (mutate (from (c 1)) (to (u 1)) (rate 0.107003)) (mutate (from (c 1)) (to (c 2)) (rate 0.000126849)) (mutate (from (g 1)) (to (a 1)) (rate 0.541573)) (mutate (from (g 1)) (to (c 1)) (rate 0.153619)) (mutate (from (g 1)) (to (u 1)) (rate 0.803682)) (mutate (from (g 1)) (to (g 2)) (rate 0.0200437)) (mutate (from (u 1)) (to (a 1)) (rate 0.063298)) (mutate (from (u 1)) (to (c 1)) (rate 0.0714773)) (mutate (from (u 1)) (to (g 1)) (rate 0.0537349)) (mutate (from (u 1)) (to (u 2)) (rate 0.0036554)) (mutate (from (a 2)) (to (a 1)) (rate 0.0015342)) (mutate (from (a 2)) (to (c 2)) (rate 0.0411113)) (mutate (from (a 2)) (to (g 2)) (rate 0.0812237)) (mutate (from (a 2)) (to (u 2)) (rate 0.165302)) (mutate (from (c 2)) (to (c 1)) (rate 0.000419434)) (mutate (from (c 2)) (to (a 2)) (rate 0.304146)) (mutate (from (c 2)) (to (g 2)) (rate 0.141506)) (mutate (from (c 2)) (to (u 2)) (rate 0.969275)) (mutate (from (g 2)) (to (g 1)) (rate 0.00161188)) (mutate (from (g 2)) (to (a 2)) (rate 0.146009)) (mutate (from (g 2)) (to (c 2)) (rate 0.0343834)) (mutate (from (g 2)) (to (u 2)) (rate 0.0920692)) (mutate (from (u 2)) (to (u 1)) (rate 0.00721206)) (mutate (from (u 2)) (to (a 2)) (rate 0.487438)) (mutate (from (u 2)) (to (c 2)) (rate 0.386336)) (mutate (from (u 2)) (to (g 2)) (rate 0.151028))) ;; end chain NUC See here for an illustration of this chain's topology.
Hybrid chains
A Each component chain must have the same number of pseudoterminals as the hybrid chain. These pseudoterminals must appear in the same order that they do in the original declaration of the component chain. The same is true of hidden classes (if there are any): component chains must have the same number of hidden classes, and the same hidden class labels, in the same order, as the hybrid chain. (If the hybrid chain has no hidden classes, then the component chains are not allowed to either.) These rules are just a longwinded way of saying that the state space for the component chains must exactly match the state space for the hybrid chain. A further requirement is that each component chain must have the "parametric" update policy.
The componentchain selection works as follows.
Every hybrid chain has a Note that this syntax can also be used to select the component chain for internal branches of the tree. Internal nodes can be named (in NewickFormat) and given "#=GS" labels even though they may not have sequence data in the alignment. If the alignment includes the line "#=GS N R L", then the branch from N's parent to N will use chain C. Calculation of column likelihoods requires not just a mutation rate matrix but also a set of initial probabilities. Suppose node N is the root node. If the alignment includes the line "#=GS N R L", then the initial probability distribution will be taken from chain C. (A similiar rule applies if N is not the root node, but its parent node is specified as being gapped. This is however a rather unusual situation, since alignment rows for internal nodes are not usually specified.)
Implicit annotations for hybrid chainsHybrid chains are sufficiently flexible to allow any lineagespecific parameterization. Typically, however, only a few parameterizations are of interest.
Several
Using a different chain for just one branchFor every pair of tree nodes, one of the following two lines is implicitly defined:
#=GS NODE1 =NODE2 1 (if NODE1 and NODE2 are identical) #=GS NODE1 =NODE2 0 (if NODE1 and NODE2 are not identical)
Here NODE1 and NODE2 are named tree nodes, as defined in the
Using a different chain for a subtree rooted at a particular nodeFor every pair of tree nodes, one of the following two lines is implicitly defined:
#=GS NODE1 :NODE2 1 (if NODE1 is descended from NODE2) #=GS NODE1 :NODE2 0 (if NODE1 is not descended from NODE2)
Here NODE1 and NODE2 are named tree nodes, as defined in the
Using a different chain for every branchFor every tree node, the following line is implicitly defined:
#=GS NODE ? NODE
Here NODE is a named tree node, as defined in the
Example: hybrid gene/pseudogene model
Suppose that
(hybridchain (terminal (HYB1 HYB2 HYB3)) (row HLABEL) (components ;; submodel COD1... selected by "#=GS SeqName HLABEL GENE" ;; this first model is also assumed to be the default for unlabeled nodes ((label GENE) (terminal (COD1 COD2 COD3))) ;; submodel NULL1... selected by "#=GS SeqName HLABEL PSEUDOGENE" ((label PSEUDOGENE) (terminal (NULL1 NULL2 NULL3)))))
Production rulesGrammar nonterminal symbols in xrate are classified into certain classes, or state types. These include emit, null and bifurcation nonterminals. The type of a given nonterminal depends on the form of the production rules that can be applied to that nonterminal.
The start nonterminal can be any of these three state types, and is chosen as follows.
If there are any Production rules contain the following elements:
Schematically, the above describes the production rule: . Example:
(transform (from (F DY)) (to (F* DX DY)) (prob 1) (annotate (column DX) (row EMIT_ANNOT) (label D)))
If the grammar was declared as parametric, i.e. it has a
Emit nonterminalsEmit nonterminals, with their associated emit rules, describe the way in which columns of alphabet symbols (coevolving according to an underlying phylogenetic tree) are generated by the grammar.
An emit nonterminal
A convenient shorthand allows pseudoterminals to be complemented, so that chains can be reused for both forward and reverse strands.
If a pseudoterminal is prefixed by the tilde symbol ( As mentioned above, the RHS pseudoterminal list must exactly match the pseudoterminal list of a chain. Additionally, some of these pseudoterminals may also appear on the lefthand side (in the same order and complementarity that they appear on the RHS). If a pseudoterminal appears on both the LHS and RHS, this indicates that it is not emitted, but rather contextual. This is used to approximate a contextdependent substitution model following the approach of Siepel and Haussler (Siepel A, Haussler D. Phylogenetic estimation of contextdependent substitution rates by maximum likelihood. Mol Biol Evol. 2004 Mar;21(3):46888. Epub 2003 Dec 5.). See the CpG dinucleotide aversion example below for an illustration of this syntax. The rule probability for an emit rule is always taken to be 1. Alignment annotations
An emit rule can be accompanied by zero or more annotations.
Loosely, each annotation corresponds to a "hidden label" for one or more of the columns.
An unannotated alignment can be annotated by running
The annotation can also be used for supervised training of the grammar,
by preannotating a StockholmFormat alignment (or database of alignments)
and running
GapsVarious tags can be used to control the gap behavior of emit rules.
See also the comments on alternative gap characters. Minimum & maximum subsequence length
The These keywords can alternately be placed in nonterminal modifier blocks.
Prefix, suffix, infix
The
The These keywords can alternately be placed in nonterminal modifier blocks.
Example: codon emit nonterminal, forward strand
Emit three nucleotides (
(transform (from (CODON)) (to (C1 C2 C3 CODON*)))
Example: codon emit nonterminal, reverse strandA similar codon emission to the previous example, but reversecomplemented
(transform (from (REVCOMP)) (to (~C3 ~C2 ~C1 REVCOMP*)))
Example: RNA basepair
An RNA basepair, emitted from nonterminal
(transform (from (BASEPAIR)) (to (LEFT_BASE BASEPAIR* RIGHT_BASE)) (annotate (column LEFT_BASE) (row SS) (label <)) (annotate (column RIGHT_BASE) (row SS) (label >)) (minlen 5))
Example: RNA basepair with probabilistic annotation
An RNA basepair, emitted from nonterminal
(transform (from (BASEPAIR)) (to (LEFT_BASE BASEPAIR* RIGHT_BASE)) (annotate (row SS) (emit (label (< >)) (prob .9)) (emit (label (_ _)) (prob .1))))
Example: CpG dinucleotide aversion
Emit a single nucleotide (
(transform (from (DX F)) (to (DX DY F*)) (annotate (column DY) (row EMIT_ANNOT) (label D)))
Null nonterminalsProduction rules describing transformations from null nonterminals have a single nonterminal on the LHS, and either a single nonterminal or an empty list on the RHS. (An empty list is equivalent to a transition to the "end" nonterminal.) Null rules have no pseudoterminals or annotations.
Example: nulltoend transition
Transition from
(transform (from (S)) (to ()) (prob 0.5))
Example: emittonull transition
Transition from nonterminal
(transform (from (REVCOMP*)) (to (S)))
Bifurcation nonterminalsProduction rules describing transformations from bifurcation nonterminals have a single nonterminal on the LHS, and two nonterminals on the RHS. Bifurcation rules have no pseudoterminals or annotations. Only one bifurcation rule is allowed for each bifurcation nonterminal. The rule probability for a bifurcation rule is thus always taken to be 1.
Example: bifurcation
Bifurcating transformation from nonterminal
(transform (from (S)) (to (S T))) Nonterminal modifiers and properties
Nonterminal declarations
(nonterminal (name START)) (nonterminal (name INTERGENIC) (minlen 1) (sumfrom)) (nonterminal (name FWD_PFOLD_S) (infix)) (nonterminal (name FWD_PFOLD_F) (minlen 2)) ...
The The indel model declaration can also be located in this block, if desired.
Note that the first
See above for a description of
sumfrom
The
It's also useful for annotation, e.g. genefinding.
As a general principle, you want your model to be as detailed and accurate as possible,
so you might want to model the fine structure of a feature (e.g. secondary structure of a noncoding RNA gene).
However, you don't care about the details of this submodel when calling the annotation;
in fact, you want to sum those details out, and simply report an overall probability for this region being "an ncRNA"
(as opposed to "an ncRNA with this particular structure").
The
The following example tells the CYK algorithm to sum, rather than max, over all outgoing transitions from nonterminal
(sumfrom S)
Note that the gff
The For example:
(gff (nonterminal FWD_RNA_GENE) (strand +) (type ncRNA)) (gff (nonterminal REV_RNA_GENE) (strand ) (type ncRNA)) The nonterminals in this example refer to the ncRnaDualStrand grammar.
The only mandatory argument is
In the output, the GFF
If the Stockholm alignment input to xrate contains a
Posterior probabilities, indicating confidence levels for the annotation, are added to the GFF
The GFF
For convenience, you can use Sexpression nesting to specify tagvalue pairs and multiplevalue tuples in the group field,
as an optional alternative to the usual GFF3 convention of using equals signs, semicolons and commas (i.e. ...is exactly equivalent to this...(nonterminal (name S) (gff (group ID=this Parent=this)) (gff (type blah2) (group (ID that1) (Parent that2))) (gff (source blah3) (group (ID this3) (Parent (this3 this that1))))) (nonterminal (name S) (gff (group "ID=this;Parent=this")) (gff (type blah2) (group "ID=that1;Parent=that2")) (gff (source blah3) (group "ID=this3;Parent=this3,this,that1")))
AnnotationVarious forms of annotation can be produced using several grammar directives.
Stockholm #=GC linesSee alignment annotation.
Stockholm
GFF outputSee the gff tag.
Wiggle tracks
(wiggle (name Track1) (nonterminal S)) (wiggle (name Track2) (terminal X)) (wiggle (name Track3) (nonterminal CODON) (terminal POS1)) (wiggle (name Track4) (component (nonterminal S) (weight 1)) (component (terminal X) (weight 2)) (component (nonterminal CODON) (terminal POS1) (weight 3)))
The The above construct would generate four wiggle tracks:
If the Stockholm alignment input to xrate contains a
Other grammar fields
foldstringtag
(foldstringtag SS_cons) In some circumstances, it is desirable to constrain the set of subsequence coordinates that are visited during dynamic programming. For example, when using a secondary structure grammar (such as pfold.eg) for the purpose of ancestral reconstruction on a large alignment, one typically does not want the algorithm to iterate over (or allocate memory for) all L^2 subsequences.
The
The xrate macro preprocessorSeveral kinds of macro are automatically expanded by xrate before any training or alignment annotation takes place. Macro expansion is a oneoff, irreversible event: if the grammar file is saved after macro substitution has taken place, the original macros will not be recoverable. Preprocessing and parsing take place in the following order:
Including files
(&include ~/dart/grammars/hky85.eg)
The
Printing warnings
(&warn Generating column COLUMN ...)
Prints the atoms following
Simple substitutions
The For example,
(&define X yellow) curious X (mellow X) evaluates to
curious yellow (mellow yellow)
Currently, only atomic expressions may be substituted in; so, for example,
Note also that the binding is static. You cannot use
List operationsThe following operators fold a list into a single element during macro preprocessing.
Concatenation
The
(&. X Y Z) (&cat X Y Z) both evaluate to
XYZ
Summation
The
(&+ 1 10 5) (&sum 1 10 5) both evaluate to
6
Multiplication
The
(&* 2 3 5) (&mul 2 3 5) both evaluate to
30
Binary operations
Division
(&/ X Y) (&div X Y)
Both evaluate to the floatingpoint division
Modulus
(&% A B) (&mod A B)
Both evaluate to the integer modulus operation
Subtraction
(& X Y) (&sub X Y)
Both evaluate to the integer subtraction
IterationsThe following macros generate a list of elements from a template during preprocessing.
foreach
(&foreach VAR (LIST) EXPR) Inserts one copy of EXPR for every element of LIST. Any occurrences of VAR within EXPR will be replaced by the corresponding element of LIST. For example,
(&foreach VAR (1 2 3) (VAR + 1)) evaluates to
(1 + 1) (1 + 2) (1 + 3)
(&foreach VAR (1 2 3) VAR *) evaluates to
1 * 2 * 3 *
foreachinteger
As
(&foreach VAR (MINVAL MAXVAL) EXPR) For example,
(&foreachinteger VAR (1 3) VAR *) evaluates to
1 * 2 * 3 *
foreachtoken
As
(&foreachtoken VAR EXPR)
foreachnode, foreachbranch, foreachleaf, foreachancestor
As
(&foreachnode VAR EXPR) The various forms allow iteration over all named nodes (&==&foreachnode==), all named nodes except the root (&==&foreachbranch==), all named leaf nodes (&==&foreachleaf==) or all named internal nodes (&==&foreachancestor==).
Logic operationsYou can do some basic logic in the macro language. For more elaborate computations, use the builtin scheme interpreter.
Equality
(&= SEXPR1 SEXPR2) (&eq SEXPR1 SEXPR2) (&!= SEXPR1 SEXPR2) (&neq SEXPR1 SEXPR2)
If the two Sexpressions,
Arithmetic comparisons
(&> EXPR1 EXPR2) (> EXPR1 EXPR2) (&< EXPR1 EXPR2) (< EXPR1 EXPR2) (&>= EXPR1 EXPR2) (&geq EXPR1 EXPR2) (&<= EXPR1 EXPR2) (&leq EXPR1 EXPR2)
Arithmetic comparisons between numerical expressions, returning
Conditional operator
(&? TEST_EXPR TRUE_EXPR FALSE_EXPR) (&if TEST_EXPR TRUE_EXPR FALSE_EXPR)
If the integer value of
Boolean operations
<code> (&and X Y) (&or X Y) (&not X) </code> These do what you'd expect.
<code> (&and X Y Z) (&or A B C D E) </code>
Miscellaneous functions
ASCII character manipulation
(&chr INT) (&ord CHAR)
Numerical functions
<code> (&int EXPR) </code>
Special constantsSome special constants are autosubstituted during macro expansion.
&TOKENSEvaluates to the number of tokens in the terminal alphabet.
&NODES, &BRANCHES, &LEAVES, &ANCESTORSEach of these evaluates to the number of tree nodes of a particular class. The respective classes are
As with
&COLUMNSEvaluates to the number of columns in the alignment. This macro only works if the input alignment database contains exactly one alignment. Arbitrary Scheme expressionsAt some point, the xrate macros may become too limiting for you, at which point you may decide you need to write an actual program to generate your grammar (hey, it happens). If you compiled xrate on a system with the Guile library present, you can evaluate arbitrary Scheme expressions inside a grammar file.
&scheme
The
For example, the following code will be transformed to (blakes (&scheme (define x 3) (define (y a) (+ a 5))) (&scheme (y 2))) Evaluation of Scheme expressions is performed after expansion of all other macros. The order of evaluation is a depthfirst recursive traversal of the Sexpression tree.
If you want to evaluate a Scheme expression and discard the return value (i.e. to change the Scheme environment without adding anything to the grammar),
you can use
Within the Scheme expression, the input alignment is bound to the Scheme symbol These keywords and their behavior are currently documented here: DartSchemeFunctions
For example, the following code is equivalent to the (&scheme (stockholmcolumncount alignment))
Macro debugging
To dump the macroexpanded, grammar to a file after postprocessing, use the
In combination with the
cd dart xrate src/handel/t/short.stk g grammars/ancestral_gc.eg x expanded.eg noannotate > /dev/null cat expanded.eg
Note that the 

Main.XrateFormat r193  20150808  01:47:32  IanHolmes  Biowiki content is in the public domain. Comments on this site? XrateFormat">Send feedback 