Xrate Grammars

From Biowiki
Jump to: navigation, search

XRate grammar files

This page describes the repository of XRATE grammar files included in the DART software package.

See xrate format for a description of the file format.

Example xrate grammar files

The dart/grammars subdirectory includes many example grammars for DNA, protein and RNA sequences.

Here are a few examples of working xrate grammar files. The techniques illustrated here can be mixed and matched. Some of the grammars use xrate macros, which is like a tiny lisp-like dialect for specifying repetitively-structured grammars.

Point substitution models

Grammars that implement point substitution models have two (almost trivial) rules: S -> X S where S is a nonterminal and X is an alignment column, and S -> End. The emitted alignment column is generated on some phylogeny (which can be specified in the input Stockholm format alignment file, or will otherwise be estimated from that alignment) using some substitution rate matrix (which is specified as part of the grammar). The symbol X is called a pseudoterminal in xrate format jargon.

These grammar files, then, effectively just illustrate the file format for the substitution rate matrix & the notational principle of tying rate matrices to grammars using pseudoterminals:

  • Classic low-dimensional models of point substitution
    • jukescantor.eg -- Jukes and Cantor's 1969 model (uniform base frequencies, single substitution rate)
    • kimura2.eg -- Kimura's 1980 two-parameter model (transition/transversion bias)
    • fels81.eg -- Felsenstein's 1981 model (non-uniform base frequencies)
    • hky85.eg -- The HKY85 model (transition/transversion bias and non-uniform base frequencies)
    • rev.eg -- General reversible model (DNA bases)
    • irrev.eg -- General irreversible model (DNA bases)
    • nullprot.eg -- General reversible model (amino acids)
    • sn.eg -- Rough approximation to CodeML's f4x3 model (codon model with site-specific nucleotide frequencies, transition/transversion ratio and synonymous/nonsynonymous rates)

The above xrate files illustrate the idea of a basic point substitution model. The following xrate files combine several such models, using a grammar to describe how different substitution models are used for different alignment columns.

Feature predictors

  • Protein grammars
    • nullprot.eg -- the general reversible model for amino acids
    • prot3.eg -- 3-state protein phylo-HMM a la Thorne, Goldman & Jones
  • RNA folding grammars (following Hein, Knudsen et al)
  • RNA gene prediction grammars (following Jakob Skou Pedersen, Irmtraud Meyer et al)

Lineage-specific evolutionary grammars

  • Lineage-specific phylo-grammars, following Adam Siepel, David Haussler et al

Site-specific models

  • Column-by-column substitution models, following e.g. Bruno & Halpern, Eisen & Moses, etc.
    • site_specific_protein.eg -- site-specific frequencies for protein substitution models using the iteration macros. Inspired by RIND
    • site_specific.eg -- site-specific frequencies for DNA substitution models (only difference is the alphabet)

Grammars that use the Scheme interpreter

-- Ian Holmes - 18 Mar 2009