List of features to be added to xgram software,
loosely ranked by priority (a murky tradeoff between expected benefit, ease of implementation, user demand & funlevel).
For features specific to the xgram file format, please see this page: XgramFormatWishList.
 Migrate this entire list to the xratedev queue on the RequestTracker  done; IH 9/20/2007. RT tickets are linked below.
 Ticket:100 Multiple, probabilistic annotation tracks (i.e. allow probability distributions over annotation chars)
 Ticket:101 Inference of ancestral sequences (e.g. as posterior weight matrices)
 Ticket:102 A new gap model: explicitly specify which sequences must be ungapped. This allows explicit indel phyloHMMs in the manner of Blanchette et al
 Ticket:104 Allow any grammar to be used for tree estimation, not just null models
 This will require updating two distinct parts of the code:
 distance matrix calculations for neighborjoining (requires computing pairwise distance for each pair of sequences);
 branchlength EM (requires computing expected counts for every timedependent part of the likelihood, i.e. all chains & indel models).
 Stochastic modes of operation
 Ticket:105 Sampled traceback
 Ticket:106 MCMC parameter sampling: Dirichlet/gamma proposal distributions from EM counts
 Ticket:107 Expected counts could be calculated in parallel, then summed, using a framework like map reduce. The
observedcounts
clause already provides a format for representing this (for parametric grammars)... in fact, this could be achieved hackily by wrapping xrate in several Perl scripts... perhaps it is premature to think about doing this without a concrete framework (e.g. Google knocking at our door offering to train phylogrammars for us on their compute farm)
 Ticket:108 Report equilibrium frequencies in grammar file, as well as initial probs (would be useful for thinking about evolutionary irreversibility c.f. Koonin et al 2005)
 Ticket:109 Length distributions over transitions into states (in manner of Generalized HMM)
 If not an explicit functional form, a variable fixing the maximum number of pseudoterminals that can be emitted by a particular state (e.g. for faster parsing of SCFGs when the approximate length of the covarying region is known beforehand)
 Rendered less necessary by sumoverstates traceback (
sumfrom
) added 4/30/2006?  IanHolmes
 Sumstates can approximate a smoothed length distribution quite well, but using them is kind of involved
 Ticket:110 Take advantage of Stockholm format "tree number" (
#=GF TN
)
 Tree number selected in the grammar file
 Different trees for different states
 interesting when you have different tree topologies in the same alignment
 topology search: structural EM algorithm of Friedman et al
 Ticket:111 Sparse chain state occupancy for features with strict consensus, e.g. splice sites
 Some infrastructure for this in DartSrc:ecfg/ecfg.h, need more (4/22/2006)
 By this do you mean an implementation of the sparse DP algorithm we discussed earlier? (anonymous commenter)
 No, this is about minimizing the cardinality of the state space. E.g. 99.98% of mammalian splice junctions are GTAG, GCAG or ATAC, so we should only really need 3 states (not 4^4=256) to model this in a phylogrammar. Since some of the training algorithms are in the number of states, this matters! (Actually, StrataSplice makes me wonder if GC contentbinned PWM tracks wouldn't be better in this particular case, but the underlying point holds...) IanHolmes
Completed wishlist items

gff
block improvements: (implemented 9/4/07  IH)
 Allow
gff
blocks to be associated with null & bifurcation nonterminals, as well as emissions
 Allow reporting of Inside or CYK scores for multiple nonterminals in one
gff
block (not just the main nonterminal that triggers the GFF output)  this will make oddsratios of the form P(model)/P(null) easier to calculate
 Generation of simulated data from phylogrammars (Done  IH, 7/11/2007)
 Posterior expected counts
 counts and wait times in xgram format output file (now there as a "latent" feature, 4/22/2006) (now implemented via
observedcounts
, 2/11/2006)
 posterior probabilities for parse tree nodes in Stockholm format output
 CYK traceback only (currently implemented, 4/22/2006)
 all cells (implemented 7/13/2006)
 Option to ignore excessively gappy columns, as in Pfold
 (Done 2/19/2007  IH) Lineagedependent parameterisations (c.f. "local parameters" in HyPhy)
 (IH 2/12/2007) In preparation for this, various source files in
dart/src/hsm/
and dart/src/ecfg/
have been marked up with comments indicating the changes that need to be made; grep text for "NB lineagedependent models"
 current planned way of implementing these models is to introduce a new
hybridchain
tag which links together several "standard" chains (defined elsewhere), with different chains selected on different branches of the tree using the Stockholm format #=GS
syntax, e.g.
(hybridchain
(terminal (HYB1 HYB2 HYB3))
(row HLABEL)
;; submodel COD1... selected by "#=GS SeqName HLABEL GENE"
;; this first model is also assumed to be the default for unlabeled nodes
((label GENE) (terminal (COD1 COD2 COD3)))
;; submodel NULL1... selected by "#=GS SeqName HLABEL PSEUDOGENE"
((label PSEUDOGENE) (terminal (NULL1 NULL2 NULL3)))
)
 Priors (at least pseudocounts and wait times) for parameters
 There is now at least a crude facility to add pseudocounts to EM chain update statistics  IH, 5/20/2006
 This has now been upgraded to "proper", parameterspecific pseudocounts & pseudotimes  IH, 2/11/2007
 xrate should be a near xgramclone like xprot & xfold (based on ECFG_main)
 Done 5/25/06 IH (the wrapper script "old_xrate.pl" in dart/perl exists for backward compatibility)
 Make behavior of treeestimating code more intuitive by very simply patching a couple of preset grammars: again, a nobrainer
 Explanation: xfold/xgram/xprot currently have counterintuitive behavior when trying to estimate a tree. For example, in xfold, if a preset grammar is used, then the HKY85 substitution model is used to estimate pairwise distances (and hence branch lengths). This occurs even if the HKY85 model is not explicitly present in the preset grammar. (For xprot, read "PAM" instead of "HKY85".) However, if a grammar is loaded from a file, then the substitution model for treeestimation is taken directly from that grammar file. This means that if you save a grammar to a file, then reload it, you get a different estimated tree. Very bad. Needs to be fixed.
 Alternatives:
 My favorite cos it's quick and easy  IH (now done, 5/25/06)
 remove undocumented matrix preset behavior of xprot/xfold
 add PAM and Kimura2 as dummy chains at the beginning of xprot/xfold grammars (actually this part has not been done, which was even quicker and easier than doing it)
 (not HKY85 since quickfitting eqm freqs is too dodgy)
 document the new rule: branch lengths can be fit once, initially
 eventually a more integrated EM schedule? branchlength priors?
 I've grown less fond of these  principal of minimum effort  IH
 hive off all treeestimating functionality into a separate program, xtree
 retain current treeestimating functionality, but allow substitution model for tree EM to be explicitly specified in grammar file
 upgrade treeestimating code to use entire phylogrammar, not just a single chain (repromoted to main wishlist  3/31/2007)
 IanHolmes  22 Apr 2006
 Posterior probability output compatible with Mathematica/gnuplot formats would be nice.  RobertBradley  25 Apr 2006 22:20:38
 Converting the current format into the gnuplot data file format seems straightforward using a Perl oneliner or similar  IanHolmes  25 Apr 2006 22:54:14
 How about a command line parameter to score an alignment and its reverse complement, like "bothstrands" in RNAz?  YuriBendana  05 Oct 2006 23:25:31
 The new DartPerl:windowlicker.pl script can do this. Because xrate grammars can be dualstranded (and the grammar includes explicit syntax for reverse complementing emissions), I think this is functionality that belongs either in the grammar itself, or in an external program (like windowlicker)  IanHolmes  01 Apr 2007 01:15:46
Topic revision: r117  20070920 
IanHolmes