- xrate file format
- Overall structure of grammar files
- Alphabets
- Grammars
- The xrate macro preprocessor
The Participle, with an Article before it, and the Preposition of after it, becomes a Substantive, expressing the action itself which the Verb signifies: as, "These are the Rules of Grammar, by the observing of which you may avoid mistakes."-- Bishop Robert Lowth, A Short Introduction to English Grammar, 1775.
This document describes the grammar file format for the xrate phylo-grammar alignment analysis package.
Please see the xrate software page for more info on the program itself (and e.g. here for more info on phylo-grammars).
The dart/grammars
subdirectory includes many example grammars for DNA, protein and RNA sequences.
Here are a few examples of working xrate grammar files. The techniques illustrated here can be mixed and matched. Some of the grammars use xrate macros, which is like a tiny lisp-like dialect for specifying repetitively-structured grammars.
Grammars that implement point substitution models have two (almost trivial) rules:
S -> X S
where S
is a nonterminal and X
is an alignment column,
and S -> End
.
The emitted alignment column is generated on some phylogeny (which can be specified in the input Stockholm format alignment file, or will otherwise be estimated from that alignment)
using some substitution rate matrix (which is specified as part of the grammar).
The symbol X
is called a pseudoterminal in xrate format jargon.
These grammar files, then, effectively just illustrate the file format for the substitution rate matrix & the notational principle of tying rate matrices to grammars using pseudoterminals:
The above xrate files illustrate the idea of a basic point substitution model. The following xrate files combine several such models, using a grammar to describe how different substitution models are used for different alignment columns.
The XrateFormat for specifying evolutionary context-free grammars uses LispSExpressions and consists of several parts:
These parts are described in more detail below.
The complete syntax of xrate's S-expression-based format (as checked by xrate's built-in syntax validator) can be printed by specifying the command-line option -log SEXPR_SYNTAX
when xrate is run.
(This output is, of course, the grammar of the grammar...)
Alphabet declarations must contain
(alphabet (name DNA) (token (a c g t)) (complement (t g c a)))
(alphabet (name Protein) (token (a r n d c q e g h i l k m f p s t w y v)))
(alphabet (name DNA) (token (a c g t)) (complement (t g c a)) (extend (to u) (from t)) (extend (to r) (from a) (from g)) (extend (to y) (from c) (from t)) (extend (to m) (from a) (from c)) (extend (to k) (from g) (from t)) (extend (to s) (from c) (from g)) (extend (to w) (from a) (from t)) (extend (to h) (from a) (from c) (from t)) (extend (to b) (from c) (from g) (from t)) (extend (to v) (from a) (from c) (from g)) (extend (to d) (from a) (from g) (from t)) (extend (to n) (from a) (from c) (from g) (from t)) (extend (to x) (from a) (from c) (from g) (from t)) (wildcard *))
If the alphabet includes characters that would normally be interpreted as gaps (i.e. -
, .
or _
)
then the --gapchar
command-line option should be used to direct xrate not to interpret those characters as gaps,
but to use an alternative symbol (or symbols).
This is useful if gaps are being modeled as a "fifth nucleotide", "twenty-first amino acid" etc.
(grammar (name Kimura2) ;; rest of grammar goes here )
The name
tag is used to specify a name for the grammar.
The name must be a single S-expression atom (i.e. a string containing no whitespace or parentheses).
(grammar (meta (brief-description This is the Kimura two-parameter model (Kimura, 1980) .) (graphviz-layout-style neato) (readme-url http://biowiki.org/XrateFormat) ;; other meta information goes here ) ;; rest of grammar goes here )
The meta
tag is used to pass information about the grammar to xrate helper applications, such as xREI.
All information inside a meta
tag is preserved, but otherwise ignored, by xrate.
It's up to helper applications to define their own keywords for structuring information inside meta
blocks
(e.g. the example above uses brief-description
, but this is not a keyword that is recognized by xrate itself; only xREI).
A very common tag is parametric
.
This is present at the start of parametric grammars; i.e., grammars whose substitution rates and rule probabilities can be expressed as algebraically structured functions of a parameter set.
(parametric)
, if present, indicates that rule probabilities are algebraic functions of the grammar parameters.
If a grammar is not designated "parametric", it follows that its de facto parametric structure is just that specified by the structure of the xrate-format file: namely, sets of rule probabilities and rate parameters for the various substitution models. This is syntactically simpler, but less expressive.
In this simpler case, grammars can include several tags indicating how the grammar should be "trained".
(update-rules X)
, where X is 0 or 1, determines whether rule probabilities will be updated during EM
(update-rates X)
, where X is 0 or 1, determines whether chain rates and probabilities will be updated during EM
This sort of control over the EM training algorithm is effectively provided for parametric models
via designation of some parameters as const-pgroup
or const-rate
.
An xrate grammar file specifies many rates (for substitution events) and probabilities (for Markov chain equilibria, length distributions & transformation rules). These rates and probabilities can be declared "parametric". Instead of being treated as free variables, specified as numerical constants, that are independently updated during EM updates (as is the default behavior), "parametric" rates (or probabilities) are specified as algebraic functions on a set of parameters associated with the grammar. This allows model designers to richly constrain the parameter space of models, so as to avoid overtraining and similar sparse data problems.
There are two kinds of parameters: rate parameters, and probability parameters.
They can be enclosed in a block named params
, in which case they will be updated during EM, or a block named const
, in which case they will not be updated (i.e. they will be treated as constants).
The params
and const
blocks can be used to specify either rate or probability parameters.
The parameter type is automatically inferred by looking at the sub-block containing the parameter name-value assignments.
If there's only a single parameter in this group, then it's assumed to be a rate parameter; otherwise, it's a probability.
This is somewhat sloppy, and a more rigorous parameter typing can be enforced
by using the pgroup
or const-pgroup
keywords (for co-normalized groups of probability parameters)
or the rate
or const-rate
keywords (for rate parameters).
Parameters declared in pgroup
or rate
blocks will be updated during EM,
whereas parameters declared in const-pgroup
or const-rate
blocks will be held constant.
Note that it is YOUR responsibility to ensure that the parameters in pgroup
or const-pgroup
blocks
are properly co-normalized. While xrate will normalize parameters in a pgroup
or const-pgroup
after each round of EM, it will NOT check that they are normalized on input.
So, if you supply un-normalized parameters and don't do any EM, your final "likelihood" score may not, in fact, be probabilistically kosher.
Conversely, if you supply super-normalized parameters and you do try some EM, the "likelihood" may decrease (since the parameters will go from being super-normalized to being normalized), which may stop EM prematurely.
Despite these issues, tolerance of un-normalized parameters is a feature, not a bug:
it allows you to use e.g. odds-ratios (instead of probabilities) as parameters in a pgroup
, if you're sufficiently fluent in probabilistic voudun to brave such an attempt.
Bottom line: Be Careful.
There are almost no restrictions on parameter names.
A parameter name cannot contain characters that are reserved for the S-expression format (i.e. whitespace, parentheses or semicolons),
nor can it be exactly the same as an arithmetic function or macro (+
, *
, /
, .
, ?
, &foreach
, &int
, etc).
It's suggested that you avoid names beginning with an ampersand &
, as these may be used for new macros in future.
Other than this, you pretty much have a free hand:
you can call a parameter by a name like $#*+/-1^-!!~&%
if you wish,
though it's probably not advisable.
Rate parameters are considered free to vary independently of all other parameters (they must remain nonnegative, but that's the only constraint). They appear as a name-value pair, enclosed by single or double brackets (both are valid).
;; The parameters for Kimura's transition-transversion model. (rate ((alpha 4)) ;; transition rate ((beta 1))) ;; transversion rate
This could also be valid written with single brackets around each rate parameter:
(rate (alpha 4) ;; transition rate (beta 1)) ;; transversion rate
The params
keyword could be used in place of the rate
keyword (but it's more cryptic, so avoid it).
Probability parameters satisfy the following constraints: (i) they are nonnegative; (ii) all probability parameters in a mutually exclusive set must sum to 1.
The syntax for declaring and assigning values to probability parameters is similar to that for rate parameters: each parameter-value is a pair enclosed by parentheses. Sets of mutually exclusive parameters are declared together as a list, enclosed by parentheses.
This example includes both probability parameters (the base frequencies) and rate parameters (the transition and transversion rates). The base frequencies are a mutually exclusive set.
;; Parameters for the HKY85 model (pgroup ((piA .25) (piC .25) (piG .25) (piT .25))) ;; base frequencies (rate ((alpha 1)) ;; transition rate ((beta 1))) ;; transversion rate
This could be written (more cryptically) as a single params
block:
;; Parameters for the HKY85 model (params ((piA .25) (piC .25) (piG .25) (piT .25)) ;; base frequencies ((alpha 1)) ;; transition rate ((beta 1))) ;; transversion rate
Note that, in a params
block, parameter lists containing only one parameter are treated as rates, not probabilities. So, for example, in a construct such as this:
(params ((alpha 1))) ;; transition rate
alpha
is a rate parameter, not a probability parameter. Although it
is enclosed in double parentheses, which would otherwise technically
make it a list containing one probability parameter, that interpretation
is not particularly useful: for a single parameter to "sum to one", it would always have to equal one.
Instead, alpha
is parsed as a rate.
This confusion between rates and probabilities in params
blocks is easy to avoid: just use the rate
and pgroup
keywords instead.
Functions are composed of probability parameters and the binary operators
for multiplication, division and addition (*
, /
and +
).
The operators and operands must be separated by whitespace.
The caret operator (^
) denotes exponentiation and can be used to raise a function to a constant power.
Both the function being exponentiated, and the entire expression (function and exponent), must be enclosed by parentheses,
like so: ((a / b) ^ 0.5)
To guarantee convergence of the EM algorithm, functions are also required to satisfy dimensional constraints that depend on whether they occur in a rate or probability context, as follows. Probability parameters should be considered dimensionless, and rate parameters have dimensions of (i.e. "per unit time"). Dimensional analysis should be used to ensure that rate functions and probability functions composed of such parameters have the according dimensionality. (Numerical constants can be interpreted "sloppily", i.e. their dimensions can be changed to fit.)
For example, if p and q are probabilities and R is a rate,
then R
, p * R
, q * R
and (p + q) * R / 3
are all valid rate functions (but not probability functions);
p
, p * q
and p * (q + p)
are all valid probability functions (but not rate functions);
and 0.5
is both a valid rate function and a valid probability function.
Dimensional constraints are not strictly enforced by xrate, but the EM algorithm may not converge if dimensionally incorrect functions are used. Furthermore, the EM algorithm is not guaranteed to converge if the exponentiation operator is used.
It's important to note that probability parameters are not just dimensionless multipliers in xrate; they must really correspond to probabilities of events. Any rate expression that uses probability parameters must do so in such a way that the probabilities correspond to legitimate, actual divisions of the event space.
Furthermore, every mutually-exclusive event must be specified. Sometimes this means including "unobserved" mutations in order to complete the mutually exclusive set.
A good way to think about events whose rate is given by expressions that combine rate and probability parameters is that an event occurs with a certain rate (specified using the rate parameter); there is then a probabilistic decision about what the type of event is (determined by the probability parameters). Some of the outcomes of this decision may be such that the event is unobserved (or "rejected"). In order to correctly count such occurrences, it's necessary to tell xrate that they exist, even though the unobserved events do not result in changes.
The Felsenstein 1981 model may be a useful illustrative example here. In one formulation of this model, "replacement" events occur at a constant rate . In a replacement event, a nucleotide X is replaced by another nucleotide Y, which is chosen with probability . In other words, once a replacement event is determined to have occurred at a nucleotide (X), there is a probabilistic decision as to which nucleotide (Y) will replace it. There is, consequently, a probability of that X and Y are the same nucleotide, in which case the replacement is unobserved. Thus, the effective substitution rate from X is , while the rate of unobserved replacements is .
In order for xrate to accurately estimate probability parameters that are used in rate expressions,
it seems to be necessary to include the unobserved substitutions.
This can be done by specifying them as mutations-to-self.
In the above example, the three observed mutations from state a
are
(mutate (from (a)) (to (c)) (rate piC * R)) (mutate (from (a)) (to (g)) (rate piG * R)) (mutate (from (a)) (to (t)) (rate piT * R))
while the fourth unobserved mutation from state a
is
(mutate (from (a)) (to (a)) (rate piA * R))
This fourth mutation, even though it does not affect the likelihood, does affect the EM counts that are computed for piA
,
and therefore seems to be needed for EM training to work correctly
(particularly in the limit of sparse training data).
Examples of this sort of thing can be found in the parametric Jukes-Cantor, HKY85 and Fels81 grammars distributed with xrate.
When a parameter name is preceded by the const
operator in an expression, then that particular instance of the parameter
is considered to be a constant for the purposes of estimating expected summary statistics and subsequent EM training.
Evaluation of the function proceeds as normal, but the derivative of this expression (with respect to this parameter only) is regarded to be zero.
For example, consider the following two functions. The first function evaluates to x+3, and its derivative w.r.t. x is 1:
(x + 3)
The second function still evaluates to x+3, but xrate thinks its derivative w.r.t. x is zero:
((const x) + 3)
Pray (to your preferred deities, patron saints, Unix daemons or cyberspace loa) that you never need to use this construct.
Using it correctly requires a pretty good knowledge of the way that the EM algorithm functions.
This (clearly) is rather dark magic and is not recommended for the casual practicioner.
Examples can be found in DartGrammar:sn.eg, where the ~k
parameter is never observed directly, but is confounded with parameters that are.
After training, the grammar file contains an expected-counts
block.
The counts in this block indicate the expected counts for each probability and rate parameter,
and the expected wait times for each rate parameter.
For example, training the HKY85 model on a short alignment (Rfam:RF00155) yields the following counts:
(expected-counts (piA 36.0195) (piC 29.0076) (piG 37.0184) (piT 35.0149) (alpha 1.01919 1.67501) (beta 2.03806 3.3364)) ;; end observed-counts
The interpretation of this block is that the event whose probability is given by piA
occurred an expected 36.0195 times
(and piC
an expected 29.0076 times, etc.)
while the event whose rate is given by alpha
occurred an expected 1.01919 times in an expected total available time of 1.67501.
Parameter-specific pseudocounts for training can be specified in the file.
These have the same format as the observed-counts
block, but the tag is pseudocounts
.
These counts are added to the expected counts obtained at the E-step of EM, and the totals are used in the M-step. This is equivalent to specifying a Dirichlet prior for probability parameters, or a gamma prior for rate parameters.
An example, for HKY85:
(pseudocounts (piA 36.0195) (piC 29.0076) (piG 37.0184) (piT 35.0149) (alpha 1.01919 1.67501) (beta 2.03806 3.3364))
A chain
block declares a Markov chain with rate matrix
and initial probability distribution
where the states are -mer words over the grammar's terminal alphabet
(plus an optional hidden "label").
The declaration contains several parts:
irrev
indicating an irreversible chain
rev
indicating a general reversible chain
rind
indicating a reversible chain with the constraint , as in Bill Bruno's "RIND" paper (Bruno WJ. Modeling residue usage in aligned protein sequences via maximum likelihood. Mol Biol Evol. 1996 Dec;13(10):1368-74.).
parametric
indicating a general irreversible chain, the rates and initial probabilities of which are not free variables, but rather algebraic functions of the separately-declared grammar parameters. During EM, the parameters will be updated, but the form of the functions will remain unchanged.
This first example is the irreversible "unidirectional cycler":
(chain ; declare a Markov chain (update-policy irrev) ; EM update policy (terminal (RX)) ; abstract state label (initial (state (a)) (prob 1)) ; initial distribution (mutate (from (a)) (to (c)) (rate 1)) (mutate (from (c)) (to (g)) (rate 1)) (mutate (from (g)) (to (t)) (rate 1)) (mutate (from (t)) (to (a)) (rate 1)))
See here for an illustration of this chain's topology.
This example is a reversible RNA chain with a hidden class variable that can take two states, labeled 1
and 2
.
(chain (update-policy rev) (terminal (NUC)) (hidden-class (row CLASS) (label (1 2))) ;; initial probability distribution (initial (state (a 1)) (prob 0.0238705)) (initial (state (c 1)) (prob 0.136706)) (initial (state (g 1)) (prob 0.0136832)) (initial (state (u 1)) (prob 0.204652)) (initial (state (a 2)) (prob 0.305866)) (initial (state (c 2)) (prob 0.0413438)) (initial (state (g 2)) (prob 0.170151)) (initial (state (u 2)) (prob 0.103727)) ;; mutation rates (mutate (from (a 1)) (to (c 1)) (rate 0.157809)) (mutate (from (a 1)) (to (g 1)) (rate 0.310445)) (mutate (from (a 1)) (to (u 1)) (rate 0.542682)) (mutate (from (a 1)) (to (a 2)) (rate 0.0196586)) (mutate (from (c 1)) (to (a 1)) (rate 0.0275553)) (mutate (from (c 1)) (to (g 1)) (rate 0.0153761)) (mutate (from (c 1)) (to (u 1)) (rate 0.107003)) (mutate (from (c 1)) (to (c 2)) (rate 0.000126849)) (mutate (from (g 1)) (to (a 1)) (rate 0.541573)) (mutate (from (g 1)) (to (c 1)) (rate 0.153619)) (mutate (from (g 1)) (to (u 1)) (rate 0.803682)) (mutate (from (g 1)) (to (g 2)) (rate 0.0200437)) (mutate (from (u 1)) (to (a 1)) (rate 0.063298)) (mutate (from (u 1)) (to (c 1)) (rate 0.0714773)) (mutate (from (u 1)) (to (g 1)) (rate 0.0537349)) (mutate (from (u 1)) (to (u 2)) (rate 0.0036554)) (mutate (from (a 2)) (to (a 1)) (rate 0.0015342)) (mutate (from (a 2)) (to (c 2)) (rate 0.0411113)) (mutate (from (a 2)) (to (g 2)) (rate 0.0812237)) (mutate (from (a 2)) (to (u 2)) (rate 0.165302)) (mutate (from (c 2)) (to (c 1)) (rate 0.000419434)) (mutate (from (c 2)) (to (a 2)) (rate 0.304146)) (mutate (from (c 2)) (to (g 2)) (rate 0.141506)) (mutate (from (c 2)) (to (u 2)) (rate 0.969275)) (mutate (from (g 2)) (to (g 1)) (rate 0.00161188)) (mutate (from (g 2)) (to (a 2)) (rate 0.146009)) (mutate (from (g 2)) (to (c 2)) (rate 0.0343834)) (mutate (from (g 2)) (to (u 2)) (rate 0.0920692)) (mutate (from (u 2)) (to (u 1)) (rate 0.00721206)) (mutate (from (u 2)) (to (a 2)) (rate 0.487438)) (mutate (from (u 2)) (to (c 2)) (rate 0.386336)) (mutate (from (u 2)) (to (g 2)) (rate 0.151028))) ;; end chain NUC
See here for an illustration of this chain's topology.
A hybrid-chain
block describes a mutation process that varies across the phylogenetic tree.
Any one of several component chains may be used on each branch of the tree.
The "#=GS" tag in the Stockholm format alignment, which is used to specify by-sequence annotation,
selects the component chain that will be used on a particular branch.
Each component chain must have the same number of pseudoterminals as the hybrid chain. These pseudoterminals must appear in the same order that they do in the original declaration of the component chain. The same is true of hidden classes (if there are any): component chains must have the same number of hidden classes, and the same hidden class labels, in the same order, as the hybrid chain. (If the hybrid chain has no hidden classes, then the component chains are not allowed to either.) These rules are just a long-winded way of saying that the state space for the component chains must exactly match the state space for the hybrid chain.
A further requirement is that each component chain must have the "parametric" update policy.
The component-chain selection works as follows.
Every hybrid chain has a row
tag and every component chain has a label
tag.
Suppose that the row tag for the hybrid-chain is R and the label tag for a particular component-chain (C) is L.
If the alignment includes the line "#=GS S R L", then the branch from S's parent to S will use chain C.
The component chain that appears first in the hybrid-chain block is assumed to be the default for unlabeled branches.
Note that this syntax can also be used to select the component chain for internal branches of the tree. Internal nodes can be named (in NewickFormat) and given "#=GS" labels even though they may not have sequence data in the alignment. If the alignment includes the line "#=GS N R L", then the branch from N's parent to N will use chain C.
Calculation of column likelihoods requires not just a mutation rate matrix but also a set of initial probabilities. Suppose node N is the root node. If the alignment includes the line "#=GS N R L", then the initial probability distribution will be taken from chain C. (A similiar rule applies if N is not the root node, but its parent node is specified as being gapped. This is however a rather unusual situation, since alignment rows for internal nodes are not usually specified.)
Hybrid chains are sufficiently flexible to allow any lineage-specific parameterization. Typically, however, only a few parameterizations are of interest.
Several #=GS
tags are implicitly defined to facilitate working with such common parameterizations.
These implicit tags can be useful in conjunction with iteration macros.
For every pair of tree nodes, one of the following two lines is implicitly defined:
#=GS NODE1 =NODE2 1 (if NODE1 and NODE2 are identical) #=GS NODE1 =NODE2 0 (if NODE1 and NODE2 are not identical)
Here NODE1 and NODE2 are named tree nodes, as defined in the #=GF NH
field of the Stockholm alignment.
For every pair of tree nodes, one of the following two lines is implicitly defined:
#=GS NODE1 :NODE2 1 (if NODE1 is descended from NODE2) #=GS NODE1 :NODE2 0 (if NODE1 is not descended from NODE2)
Here NODE1 and NODE2 are named tree nodes, as defined in the #=GF NH
field of the Stockholm alignment.
For every tree node, the following line is implicitly defined:
#=GS NODE ? NODE
Here NODE is a named tree node, as defined in the #=GF NH
field of the Stockholm alignment.
Suppose that (COD1 COD2 COD3)
is defined elsewhere as a codon substitution model
(i.e. three nucleotides evolving together as a single codon unit)
and (NULL1 NULL2 NULL3)
is a triplet null model
(three nucleotides evolving independently).
The following hybrid chain block describes a model that switches between these two component models
according to #=GS ... HLABEL ...
lines in the input alignment:
(hybrid-chain (terminal (HYB1 HYB2 HYB3)) (row HLABEL) (components ;; submodel COD1... selected by "#=GS SeqName HLABEL GENE" ;; this first model is also assumed to be the default for unlabeled nodes ((label GENE) (terminal (COD1 COD2 COD3))) ;; submodel NULL1... selected by "#=GS SeqName HLABEL PSEUDOGENE" ((label PSEUDOGENE) (terminal (NULL1 NULL2 NULL3)))))
Grammar nonterminal symbols in xrate are classified into certain classes, or state types. These include emit, null and bifurcation nonterminals. The type of a given nonterminal depends on the form of the production rules that can be applied to that nonterminal.
The start nonterminal can be any of these three state types, and is chosen as follows.
If there are any nonterminal
blocks in the grammar, then the start nonterminal is the first nonterminal that is declared in such a block.
Otherwise, the start nonterminal is the first nonterminal that appears on the left-hand side of a production rule.
Production rules contain the following elements:
Schematically, the above describes the production rule: .
Example:
(transform (from (F DY)) (to (F* DX DY)) (prob 1) (annotate (column DX) (row EMIT_ANNOT) (label D)))
If the grammar was declared as parametric, i.e. it has a (parametric)
tag,
then the prob
expressions of production rules will be interpreted as functions of
the grammar parameters, rather than independent variables.
Emit nonterminals, with their associated emit rules, describe the way in which columns of alphabet symbols (co-evolving according to an underlying phylogenetic tree) are generated by the grammar.
An emit nonterminal X
is always paired with a corresponding null nonterminal
whose identifier is formed by appending an asterisk to the name of the emit nonterminal,
i.e. X*
.
There can be only one production rule having X
on the left-hand side.
This rule must have a single nonterminal, X*
, on the right-hand side,
together with one or more emitted pseudoterminals.
The list of pseudoterminals on the RHS must exactly match the pseudoterminal list of
a declared Markov chain (though the order of the pseudoterminals can be permuted).
A convenient shorthand allows pseudoterminals to be complemented, so that chains can be re-used for both forward and reverse strands.
If a pseudoterminal is prefixed by the tilde symbol (~
),
and the alphabet has the complement
property (i.e. DNA or RNA; see above description of alphabets),
then the emitted alphabet symbol is the complement of the pseudoterminal's actual state.
As mentioned above, the RHS pseudoterminal list must exactly match the pseudoterminal list of a chain. Additionally, some of these pseudoterminals may also appear on the left-hand side (in the same order and complementarity that they appear on the RHS). If a pseudoterminal appears on both the LHS and RHS, this indicates that it is not emitted, but rather contextual. This is used to approximate a context-dependent substitution model following the approach of Siepel and Haussler (Siepel A, Haussler D. Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Mol Biol Evol. 2004 Mar;21(3):468-88. Epub 2003 Dec 5.). See the CpG dinucleotide aversion example below for an illustration of this syntax.
The rule probability for an emit rule is always taken to be 1.
An emit rule can be accompanied by zero or more annotations.
Loosely, each annotation corresponds to a "hidden label" for one or more of the columns.
An unannotated alignment can be annotated by running xrate
in "annotate" mode
(either using the Viterbi algorithm, for HMMs, or Cocke-Younger-Kasami for SCFGs).
Each annotation also has a row label, corresponding to the label of the #=GC
line
in the annotated StockholmFormat alignment.
By default, any "unannotated" columns will be annotated with a period.
The annotation can also be used for supervised training of the grammar,
by pre-annotating a StockholmFormat alignment (or database of alignments)
and running xrate
on this file in "training" mode
(using Forward-Backward for HMMs, or Inside-Outside for SCFGs).
In this case, periods will be treated as wildcards (i.e. missing data).
Various tags can be used to control the gap behavior of emit rules.
gaps-ok
is the default tag, specifying that gaps will be treated as wildcards;
no-gaps
means that gaps are not allowed at all;
strict-gaps
means that gaps are allowed (and are treated as wildcards), but they must be all-or-nothing. Either every position contains a gap or none of them do.
See also the comments on alternative gap characters.
The minlen
and maxlen
keywords can be used within emit rules to constrain the minimum/maximum length of subsequences that can be generated by this particular nonterminal.
These keywords can alternately be placed in nonterminal modifier blocks.
The infix
keyword is used within emit rules to indicate that the maxlen
for a particular nonterminal should automatically be set to the maximum infix-subsequence length specified at the command line.
The prefix
and suffix
keywords can be used within emit rules to indicate to xrate that a particular nonterminal is used to generate prefix or suffix strings (e.g. flanking regions).
The only effect of these keywords is to suppress warnings that usually occur if the maxlen
or minlen
properties are set to larger than the maximum infix-subsequence length.
(These warnings do not apply to prefix or suffix sequences, which are not subject to the maximum infix-subseq-length constraint.)
These keywords can alternately be placed in nonterminal modifier blocks.
Emit three nucleotides (C1 C2 C3
), using a 64-state Markov chain over trinucleotides,
from nonterminal CODON
(transform (from (CODON)) (to (C1 C2 C3 CODON*)))
A similar codon emission to the previous example, but reverse-complemented
(transform (from (REVCOMP)) (to (~C3 ~C2 ~C1 REVCOMP*)))
An RNA basepair, emitted from nonterminal BASEPAIR
, using a Markov chain with pseudoterminals LEFT_BASE
and RIGHT_BASE
, which are respectively annotated "<" and ">" on the "#=GC SS ..." line in the output alignment. The (minlen 5)
clause ensures that the two paired bases must be separated by at least 3 nucleotides.
(transform (from (BASEPAIR)) (to (LEFT_BASE BASEPAIR* RIGHT_BASE)) (annotate (column LEFT_BASE) (row SS) (label <)) (annotate (column RIGHT_BASE) (row SS) (label >)) (minlen 5))
An RNA basepair, emitted from nonterminal BASEPAIR
, using a Markov chain with pseudoterminals LEFT_BASE
and RIGHT_BASE
, which are respectively annotated on the "#=GC SS ..." line in the output alignment as "<" and ">" (with probability .9) or "_" (with probability .1). Such probabilistic annotation allows for cases where the basepair has been omitted in a training alignment. It also allows additional types of information (e.g. transcriptomics data) to be supplied to the model.
(transform (from (BASEPAIR)) (to (LEFT_BASE BASEPAIR* RIGHT_BASE)) (annotate (row SS) (emit (label (< >)) (prob .9)) (emit (label (_ _)) (prob .1))))
Emit a single nucleotide (DY
), given a previous nucleotide of "context" (DX
),
using a 16-state Markov chain over dinucleotides (pseudoterminal labels DX
and DY
),
annotating the emitted column (DY
) with character "D" in the row labeled EMIT_ANNOT
(NB in a StockholmFormat alignment, this annotation row will be labeled #=GC EMIT_ANNOT
):
(transform (from (DX F)) (to (DX DY F*)) (annotate (column DY) (row EMIT_ANNOT) (label D)))
Production rules describing transformations from null nonterminals have a single nonterminal on the LHS, and either a single nonterminal or an empty list on the RHS. (An empty list is equivalent to a transition to the "end" nonterminal.) Null rules have no pseudoterminals or annotations.
Transition from S
to end nonterminal, probability 0.5
(transform (from (S)) (to ()) (prob 0.5))
Transition from nonterminal REVCOMP*
(presumed to be a null nonterminal paired to a corresponding emit nonterminal REVCOMP
)
to nonterminal S
, probability 1
(transform (from (REVCOMP*)) (to (S)))
Production rules describing transformations from bifurcation nonterminals have a single nonterminal on the LHS, and two nonterminals on the RHS. Bifurcation rules have no pseudoterminals or annotations.
Only one bifurcation rule is allowed for each bifurcation nonterminal. The rule probability for a bifurcation rule is thus always taken to be 1.
Bifurcating transformation from nonterminal S
into nonterminals S
and T
(transform (from (S)) (to (S T)))
(nonterminal (name START)) (nonterminal (name INTERGENIC) (minlen 1) (sum-from)) (nonterminal (name FWD_PFOLD_S) (infix)) (nonterminal (name FWD_PFOLD_F) (minlen 2)) ...
The minlen
, maxlen
, prefix
, suffix
, infix
, sum-from
and gff
keywords can be collected together in a nonterminal
block that declares the nonterminal symbol along with any modifiers.
(This is the preferred way to use these modifiers, and currently the only way of specifying minlen
or maxlen
constraints on nonterminals that do not have associated emissions.)
The indel model declaration can also be located in this block, if desired.
Note that the first nonterminal
block, if any such blocks are present, is used to decide the start nonterminal for the grammar.
(If no such blocks are present, the start nonterminal is the nonterminal appearing on the left-hand side of the first transformation rule in the grammar.)
See above for a description of minlen
, maxlen
, prefix
, suffix
& infix
.
The sum-from
modifier is used to denote "sum nonterminals" (aka "sum states").
These are nonterminals for which, during the CYK matrix fill algorithm, all outgoing transition probabilities are summed rather than maxed.
This is crucial for certain constructs (like using several identical states to approximate an arbitrary length distribution over some feature).
It's also useful for annotation, e.g. genefinding.
As a general principle, you want your model to be as detailed and accurate as possible,
so you might want to model the fine structure of a feature (e.g. secondary structure of a noncoding RNA gene).
However, you don't care about the details of this submodel when calling the annotation;
in fact, you want to sum those details out, and simply report an overall probability for this region being "an ncRNA"
(as opposed to "an ncRNA with this particular structure").
The sum-from
modifier allows you to do this.
The following example tells the CYK algorithm to sum, rather than max, over all outgoing transitions from nonterminal S
:
(sum-from S)
Note that the sum-from
modifier can also appear inside a nonterminal
declaration block, in which case the name of the nonterminal is implicit.
The gff
modifier is used to signal that a particular GFF format annotation should be output whenever a given nonterminal is used in the maximum-likelihood (CYK) parse tree.
This provides a way to generate GFF directly from xrate.
For example:
(gff (nonterminal FWD_RNA_GENE) (strand +) (type ncRNA)) (gff (nonterminal REV_RNA_GENE) (strand -) (type ncRNA))
The nonterminals in this example refer to the ncRnaDualStrand grammar.
The only mandatory argument is nonterminal
which specifies the nonterminal that generates this GFF annotation.
The GFF fields source
, type
, strand
& frame
can also (optionally) be specified.
In the output, the GFF score
field contains the base2-log of either
the posterior probability that the parse tree contained the given GFF feature (if the Inside-Outside algorithm was executed),
or the CYK traceback probability for the feature (if Inside-Outside was not executed).
The GFF type
field defaults to the name of the nonterminal, but may be overridden (as in the above example).
The GFF source
field defaults to the grammar name, but may be overridden.
If the Stockholm alignment input to xrate contains a #=GF ID <foo>
line, then the reference sequence (column 1) in xrate's GFF annotation will be <foo>
.
Otherwise, the reference sequence will be the number of the alignment in the input set (i.e. Alignment1
, Alignment2
, ...).
Posterior probabilities, indicating confidence levels for the annotation, are added to the GFF group
field
if the appropriate command-line option was specified.
The GFF group
field is as specified in the gff
block, with two exceptions.
As per the GFF3 specification,
a unique ID
is assigned to every feature.
If the group
field of the gff
block contains ID
or Parent
tags,
then these are substituted for the corresponding autogenerated IDs in the GFF output.
For convenience, you can use S-expression nesting to specify tag-value pairs and multiple-value tuples in the group field,
as an optional alternative to the usual GFF3 convention of using equals signs, semicolons and commas (i.e. tag1=value1;tag2=value2,value3;tag3=value4,value5,value6
).
For example, this....
...is exactly equivalent to this...(nonterminal (name S) (gff (group ID=this Parent=this)) (gff (type blah2) (group (ID that1) (Parent that2))) (gff (source blah3) (group (ID this3) (Parent (this3 this that1)))))
(nonterminal (name S) (gff (group "ID=this;Parent=this")) (gff (type blah2) (group "ID=that1;Parent=that2")) (gff (source blah3) (group "ID=this3;Parent=this3,this,that1")))
Various forms of annotation can be produced using several grammar directives.
See alignment annotation.
Stockholm #=GC
lines are the only form of annotation that can also be used to constrain the training algorithms (i.e. to do partially or completely supervised training).
See the gff tag.
(wiggle (name Track1) (nonterminal S)) (wiggle (name Track2) (terminal X)) (wiggle (name Track3) (nonterminal CODON) (terminal POS1)) (wiggle (name Track4) (component (nonterminal S) (weight 1)) (component (terminal X) (weight 2)) (component (nonterminal CODON) (terminal POS1) (weight 3)))
The wiggle
field directs xrate
to generate a wiggle format track with one data point per alignment column,
using a weighted sum of posterior probabilities that the given alignment column was generated using a particular nonterminal/terminal combination.
The above construct would generate four wiggle tracks:
Track1
for the posterior probability that a column was generated on a transition from nonterminal S
(by any terminal);
Track2
for the posterior probability that a column was generated by terminal X
(on a transition from any nonterminal);
Track3
for the posterior probability that a column was generated by terminal POS1
on a transition from nonterminal CODON;
Track4
for the weighted combination Track1+2*Track2+3*Track3
.
If the Stockholm alignment input to xrate contains a #=GF ID <foo>
line, then the name
field of the wiggle track will be <foo>
.
Otherwise, the name
field will be the number of the alignment in the input set (i.e. Alignment1
, Alignment2
, ...).
(fold-string-tag SS_cons)
In some circumstances, it is desirable to constrain the set of subsequence coordinates that are visited during dynamic programming. For example, when using a secondary structure grammar (such as pfold.eg) for the purpose of ancestral reconstruction on a large alignment, one typically does not want the algorithm to iterate over (or allocate memory for) all L^2 subsequences.
The fold-string-tag
can be used to direct xrate to interpret a particular #=GC
line as a secondary structure,
and restrict the dynamic programming matrix (and iteration) to subsequences consistent with that structure.
Several kinds of macro are automatically expanded by xrate before any training or alignment annotation takes place. Macro expansion is a one-off, irreversible event: if the grammar file is saved after macro substitution has taken place, the original macros will not be recoverable.
Preprocessing and parsing take place in the following order:
&include
directives (everything from here onwards refers to the grammar file);
alphabet
declaration;
&define
, &foreach
and &warn
directives (a preorder tree traversal, with some lookahead);
&scheme
or &scheme-discard
expressions (if compiled with Guile Scheme support);
(&include ~/dart/grammars/hky85.eg)
The &include
directive includes another file (just like #include
in C).
Shell globbing is performed on the filename, so you can use shorthands like tildes and wildcards, as in the above example.
(&warn Generating column COLUMN ...)
Prints the atoms following &warn
to standard error.
Useful to monitor progress during generation of large grammars, e.g. this column-specific model (from whence the example is taken):
DartGrammar:site_specific.eg
The &define
macro is used to indicate that one expression should be substituted for another.
For example,
(&define X yellow) curious X (mellow X)
evaluates to
curious yellow (mellow yellow)
Currently, only atomic expressions may be substituted in; so, for example, (&define X (bright yellow))
is not a legal usage,
since (bright yellow)
, unlike yellow
, is a list rather than an atom.
Note also that the binding is static. You cannot use &define
to create a dynamic function that evaluates to different values at different times.
If this is the sort of thing you need to do, you will be better off using Scheme (perhaps with the experimental Scheme language extensions to xrate).
The following operators fold a list into a single element during macro preprocessing.
The &cat
function (or the shorthand .
) concatenates a list of atoms into a single atom.
For example,
(&. X Y Z) (&cat X Y Z)
both evaluate to
XYZ
The &sum
function (or the shorthand +
) takes the sum of a list of integer values.
For example,
(&+ 1 10 -5) (&sum 1 10 -5)
both evaluate to
6
The &mul
function (or the shorthand *
) takes the product of a list of floating-point values.
For example,
(&* 2 3 5) (&mul 2 3 5)
both evaluate to
30
(&/ X Y) (&div X Y)
Both evaluate to the floating-point division X / Y
.
(&% A B) (&mod A B)
Both evaluate to the integer modulus operation X mod Y
.
(&- X Y) (&sub X Y)
Both evaluate to the integer subtraction X - Y
.
The following macros generate a list of elements from a template during preprocessing.
(&foreach VAR (LIST) EXPR)
Inserts one copy of EXPR for every element of LIST. Any occurrences of VAR within EXPR will be replaced by the corresponding element of LIST. For example,
(&foreach VAR (1 2 3) (VAR + 1))
evaluates to
(1 + 1) (1 + 2) (1 + 3)
EXPR
can include more than one item, e.g.
(&foreach VAR (1 2 3) VAR *)
evaluates to
1 * 2 * 3 *
As foreach
, but the list is specified as an integer range.
This can be useful for specifying arrays of states, c.f. Profile HMMs.
(&foreach VAR (MINVAL MAXVAL) EXPR)
For example,
(&foreach-integer VAR (1 3) VAR *)
evaluates to
1 * 2 * 3 *
As foreach
, but the list is taken to be the set of alphabet tokens.
(&foreach-token VAR EXPR)
As foreach
, but the list is taken to be [some subset of] the node names in the tree.
(NB this macro is data-dependent, and it only works if the input alignment database contains exactly one alignment;
"the tree" refers to the tree specified in the #=GF NH
field of this alignment.)
(&foreach-node VAR EXPR)
The various forms allow iteration over all named nodes (&==&foreach-node==), all named nodes except the root (&==&foreach-branch==), all named leaf nodes (&==&foreach-leaf==) or all named internal nodes (&==&foreach-ancestor==).
You can do some basic logic in the macro language. For more elaborate computations, use the built-in scheme interpreter.
(&= SEXPR1 SEXPR2) (&eq SEXPR1 SEXPR2) (&!= SEXPR1 SEXPR2) (&neq SEXPR1 SEXPR2)
If the two S-expressions, SEXPR1
and SEXPR2
, are equal/inequal, evaluate to 1
; otherwise, evaluate to 0
.
(&> EXPR1 EXPR2) (> EXPR1 EXPR2) (&< EXPR1 EXPR2) (< EXPR1 EXPR2) (&>= EXPR1 EXPR2) (&geq EXPR1 EXPR2) (&<= EXPR1 EXPR2) (&leq EXPR1 EXPR2)
Arithmetic comparisons between numerical expressions, returning 1
for true and 0
for false.
(&? TEST_EXPR TRUE_EXPR FALSE_EXPR) (&if TEST_EXPR TRUE_EXPR FALSE_EXPR)
If the integer value of TEST_EXPR
is nonzero, evaluates to TRUE_EXPR
; otherwise, evaluates to FALSE_EXPR
.
<code> (&and X Y) (&or X Y) (&not X) </code>
These do what you'd expect.
&and
and &or
don't just have to be binary; they can be list operators too, e.g.
<code> (&and X Y Z) (&or A B C D E) </code>
(&chr INT) (&ord CHAR)
&ord
evaluates to the ASCII value of character CHAR
.
&chr
evaluates to the character whose ASCII value is given by INT
.
<code> (&int EXPR) </code>
&int
evaluates to the integer value of EXPR
.
Some special constants are auto-substituted during macro expansion.
Evaluates to the number of tokens in the terminal alphabet.
Each of these evaluates to the number of tree nodes of a particular class. The respective classes are
&NODES
&BRANCHES
&LEAVES
&ANCESTORS
As with foreach-node
, etc.,
these macros only work if the input alignment database contains exactly one alignment.
Evaluates to the number of columns in the alignment.
This macro only works if the input alignment database contains exactly one alignment.
At some point, the xrate macros may become too limiting for you, at which point you may decide you need to write an actual program to generate your grammar (hey, it happens). If you compiled xrate on a system with the Guile library present, you can evaluate arbitrary Scheme expressions inside a grammar file.
The &scheme
macro evaluates a Scheme expression and interpolates the result.
It is approximately equivalent to the unquote-splicing
expression in a Scheme quasiquote
environment.
For example, the following code will be transformed to (blakes 7)
by the Scheme evaluation phase of the macro preprocessor, before your grammar is parsed by the xrate phylogrammar interpreter:
(blakes (&scheme (define x 3) (define (y a) (+ a 5))) (&scheme (y 2)))
Evaluation of Scheme expressions is performed after expansion of all other macros. The order of evaluation is a depth-first recursive traversal of the S-expression tree.
If you want to evaluate a Scheme expression and discard the return value (i.e. to change the Scheme environment without adding anything to the grammar),
you can use &scheme-discard
in place of &scheme
.
This is rarely necessary, as the most common way to change the Scheme environment is with a define
(as in the above example),
which returns an unspecified return value that is automatically discarded anyway.
Within the Scheme expression, the input alignment is bound to the Scheme symbol alignment
, and the Scheme primitives provided by xrate are all available.
These keywords and their behavior are currently documented here: DartSchemeFunctions
For example, the following code is equivalent to the &COLUMNS
macro:
(&scheme (stockholm-column-count alignment))
To dump the macro-expanded, grammar to a file after post-processing, use the -x
option to xrate.
(Note: this will not include any constructs that were not parsed correctly. See below for more on this.)
In combination with the --noannotate
option, which disables annotation (and hence prevents any dynamic programming from taking place),
this can be useful to test correct usage of macros.
For example:
cd dart xrate src/handel/t/short.stk -g grammars/ancestral_gc.eg -x expanded.eg --noannotate > /dev/null cat expanded.eg
Note that the -x
option dumps the grammar after parsing it.
If your macros generated any invalid syntax (i.e. text that was not recognized by xrate's grammar parser), this will not appear in the grammar file dumped by the -x
option.
If you want to debug all macro expansions (including any invalid syntax), use the logging option -log ECFG_MACROS
to dump out the grammar file after macro expansion, but prior to parsing.
ERROR: problems during latex INPUT: \documentclass[fleqn,12pt]{article} \usepackage{amsmath} \usepackage[normal]{xcolor} \setlength{\mathindent}{0cm} \definecolor{teal}{rgb}{0,0.5,0.5} \definecolor{navy}{rgb}{0,0,0.5} \definecolor{aqua}{rgb}{0,1,1} \definecolor{lime}{rgb}{0,1,0} \definecolor{maroon}{rgb}{0.5,0,0} \definecolor{silver}{gray}{0.75} \usepackage{latexsym} \begin{document} \pagestyle{empty} \pagecolor{white} { \color{black} \begin{math}\displaystyle P_X \lambda\end{math} } \clearpage { \color{black} \begin{math}\displaystyle C\end{math} } \clearpage { \color{black} \begin{math}\displaystyle N\end{math} } \clearpage { \color{black} \begin{math}\displaystyle \mbox{LHS} \to \mbox{RHS}\end{math} } \clearpage { \color{black} \begin{math}\displaystyle P_Y\end{math} } \clearpage { \color{black} \begin{math}\displaystyle t^{-1}\end{math} } \clearpage { \color{black} \begin{math}\displaystyle \lambda\end{math} } \clearpage { \color{black} \begin{math}\displaystyle N=2\end{math} } \clearpage { \color{black} \begin{math}\displaystyle A\end{math} } \clearpage { \color{black} \begin{math}\displaystyle (1-P_X)\lambda\end{math} } \clearpage { \color{black} \begin{math}\displaystyle R_{ij}\end{math} } \clearpage { \color{black} \begin{math}\displaystyle R_{ij} \propto \pi_j\end{math} } \clearpage { \color{black} \begin{math}\displaystyle N=3\end{math} } \clearpage { \color{black} \begin{math}\displaystyle {\cal O}((A^N C)^6)\end{math} } \clearpage { \color{black} \begin{math}\displaystyle i,j\end{math} } \clearpage { \color{black} \begin{math}\displaystyle P_X\end{math} } \clearpage { \color{black} \begin{math}\displaystyle N=1\end{math} } \clearpage { \color{black} \begin{math}\displaystyle \pi_i\end{math} } \clearpage \end{document} STDERR: This is pdfTeX, Version 3.1415926-1.40.10 (TeX Live 2009/Debian) entering extended mode (/tmp/EmZfkmduyO/bQYFUG9e_G LaTeX2e <2009/09/24> Babeland hyphenation patterns for english, usenglishmax, dumylang, noh yphenation, farsi, arabic, croatian, bulgarian, ukrainian, russian, czech, slov ak, danish, dutch, finnish, french, basque, ngerman, german, german-x-2009-06-1 9, ngerman-x-2009-06-19, ibycus, monogreek, greek, ancientgreek, hungarian, san skrit, italian, latin, latvian, lithuanian, mongolian2a, mongolian, bokmal, nyn orsk, romanian, irish, coptic, serbian, turkish, welsh, esperanto, uppersorbian , estonian, indonesian, interlingua, icelandic, kurmanji, slovenian, polish, po rtuguese, spanish, galician, catalan, swedish, ukenglish, loaded. (/usr/share/texmf-texlive/tex/latex/base/article.cls Document Class: article 2007/10/19 v1.4h Standard LaTeX document class (/usr/share/texmf-texlive/tex/latex/base/fleqn.clo) (/usr/share/texmf-texlive/tex/latex/base/size12.clo)) (/usr/share/texmf-texlive/tex/latex/amsmath/amsmath.sty For additional information on amsmath, use the `?' option. (/usr/share/texmf-texlive/tex/latex/amsmath/amstext.sty (/usr/share/texmf-texlive/tex/latex/amsmath/amsgen.sty)) (/usr/share/texmf-texlive/tex/latex/amsmath/amsbsy.sty) (/usr/share/texmf-texlive/tex/latex/amsmath/amsopn.sty)) (/usr/share/texmf/tex/latex/xcolor/xcolor.sty (/etc/texmf/tex/latex/config/color.cfg) (/usr/share/texmf-texlive/tex/latex/graphics/dvips.def)) (/usr/share/texmf-texlive/tex/latex/base/latexsym.sty) No file bQYFUG9e_G.aux. (/usr/share/texmf-texlive/tex/latex/base/ulasy.fd) [1] [2] [3] ! Text line contains an invalid character. l.32 \begin{math}\displaystyle \mbox{^^[ LHS} \to \mbox{RHS}\end{math} [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] ! Undefined control sequence. l.82 \begin{math}\displaystyle {\ca ^^Al O}((A^N C)^6)\end{math} ! Text line contains an invalid character. l.82 \begin{math}\displaystyle {\ca^^A l O}((A^N C)^6)\end{math} [14] [15] [16] [17] [18] (./bQYFUG9e_G.aux) ) (see the transcript file for additional information) Output written on bQYFUG9e_G.dvi (18 pages, 4340 bytes). Transcript written on bQYFUG9e_G.log.