List of features to be added to xgram software,
loosely ranked by priority (a murky tradeoff between expected benefit, ease of implementation, user demand & funlevel).
For features specific to the xgram file format, please see this page: XgramFormatWishList.
- Migrate this entire list to the xrate-dev queue on the RequestTracker -- done; IH 9/20/2007. RT tickets are linked below.
- Ticket:100 Multiple, probabilistic annotation tracks (i.e. allow probability distributions over annotation chars)
- Ticket:101 Inference of ancestral sequences (e.g. as posterior weight matrices)
- Ticket:102 A new gap model: explicitly specify which sequences must be ungapped. This allows explicit indel phylo-HMMs in the manner of Blanchette et al
- Ticket:104 Allow any grammar to be used for tree estimation, not just null models
- This will require updating two distinct parts of the code:
- distance matrix calculations for neighbor-joining (requires computing pairwise distance for each pair of sequences);
- branch-length EM (requires computing expected counts for every time-dependent part of the likelihood, i.e. all chains & indel models).
- Stochastic modes of operation
- Ticket:105 Sampled traceback
- Ticket:106 MCMC parameter sampling: Dirichlet/gamma proposal distributions from EM counts
- Ticket:107 Expected counts could be calculated in parallel, then summed, using a framework like map reduce. The
observed-counts clause already provides a format for representing this (for parametric grammars)... in fact, this could be achieved hackily by wrapping xrate in several Perl scripts... perhaps it is premature to think about doing this without a concrete framework (e.g. Google knocking at our door offering to train phylogrammars for us on their compute farm)
- Ticket:108 Report equilibrium frequencies in grammar file, as well as initial probs (would be useful for thinking about evolutionary irreversibility c.f. Koonin et al 2005)
- Ticket:109 Length distributions over transitions into states (in manner of Generalized HMM)
- If not an explicit functional form, a variable fixing the maximum number of pseudoterminals that can be emitted by a particular state (e.g. for faster parsing of SCFGs when the approximate length of the covarying region is known beforehand)
- Rendered less necessary by sum-over-states traceback (
sum-from) added 4/30/2006? -- IanHolmes
- Sum-states can approximate a smoothed length distribution quite well, but using them is kind of involved
- Ticket:110 Take advantage of Stockholm format "tree number" (
- Tree number selected in the grammar file
- Different trees for different states
- interesting when you have different tree topologies in the same alignment
- topology search: structural EM algorithm of Friedman et al
- Ticket:111 Sparse chain state occupancy for features with strict consensus, e.g. splice sites
- Some infrastructure for this in DartSrc:ecfg/ecfg.h, need more (4/22/2006)
- By this do you mean an implementation of the sparse DP algorithm we discussed earlier? (anonymous commenter)
- No, this is about minimizing the cardinality of the state space. E.g. 99.98% of mammalian splice junctions are GT-AG, GC-AG or AT-AC, so we should only really need 3 states (not 4^4=256) to model this in a phylogrammar. Since some of the training algorithms are in the number of states, this matters! (Actually, StrataSplice makes me wonder if GC content-binned PWM tracks wouldn't be better in this particular case, but the underlying point holds...) --IanHolmes
Completed wishlist items
gff block improvements: (implemented 9/4/07 - IH)
gff blocks to be associated with null & bifurcation nonterminals, as well as emissions
- Allow reporting of Inside or CYK scores for multiple nonterminals in one
gff block (not just the main nonterminal that triggers the GFF output) -- this will make odds-ratios of the form P(model)/P(null) easier to calculate
- Generation of simulated data from phylo-grammars (Done -- IH, 7/11/2007)
- Posterior expected counts
- counts and wait times in xgram format output file (now there as a "latent" feature, 4/22/2006) (now implemented via
- posterior probabilities for parse tree nodes in Stockholm format output
- CYK traceback only (currently implemented, 4/22/2006)
- all cells (implemented 7/13/2006)
- Option to ignore excessively gappy columns, as in Pfold
- Andrew has developed a Perl wrapper script for this, DartPerl:drop-gappy-columns.pl, which is now part of Dart (see DartPerlScripts).
- (Done 2/19/2007 - IH) Lineage-dependent parameterisations (c.f. "local parameters" in HyPhy)
- (IH 2/12/2007) In preparation for this, various source files in
dart/src/ecfg/ have been marked up with comments indicating the changes that need to be made; grep text for "NB lineage-dependent models"
- current planned way of implementing these models is to introduce a new
hybrid-chain tag which links together several "standard" chains (defined elsewhere), with different chains selected on different branches of the tree using the Stockholm format
#=GS syntax, e.g.
(terminal (HYB1 HYB2 HYB3))
;; submodel COD1... selected by "#=GS SeqName HLABEL GENE"
;; this first model is also assumed to be the default for unlabeled nodes
((label GENE) (terminal (COD1 COD2 COD3)))
;; submodel NULL1... selected by "#=GS SeqName HLABEL PSEUDOGENE"
((label PSEUDOGENE) (terminal (NULL1 NULL2 NULL3)))
- Priors (at least pseudocounts and wait times) for parameters
- There is now at least a crude facility to add pseudocounts to EM chain update statistics - IH, 5/20/2006
- This has now been upgraded to "proper", parameter-specific pseudocounts & pseudotimes - IH, 2/11/2007
- xrate should be a near xgram-clone like xprot & xfold (based on ECFG_main)
- Done 5/25/06 IH (the wrapper script "old_xrate.pl" in dart/perl exists for backward compatibility)
-- IanHolmes - 22 Apr 2006
- Make behavior of tree-estimating code more intuitive by very simply patching a couple of preset grammars: again, a no-brainer
- Explanation: xfold/xgram/xprot currently have counter-intuitive behavior when trying to estimate a tree. For example, in xfold, if a preset grammar is used, then the HKY85 substitution model is used to estimate pairwise distances (and hence branch lengths). This occurs even if the HKY85 model is not explicitly present in the preset grammar. (For xprot, read "PAM" instead of "HKY85".) However, if a grammar is loaded from a file, then the substitution model for tree-estimation is taken directly from that grammar file. This means that if you save a grammar to a file, then re-load it, you get a different estimated tree. Very bad. Needs to be fixed.
- My favorite cos it's quick and easy - IH (now done, 5/25/06)
- remove undocumented matrix preset behavior of xprot/xfold
- add PAM and Kimura2 as dummy chains at the beginning of xprot/xfold grammars (actually this part has not been done, which was even quicker and easier than doing it)
- (not HKY85 since quick-fitting eqm freqs is too dodgy)
- document the new rule: branch lengths can be fit once, initially
- eventually a more integrated EM schedule? branch-length priors?
- I've grown less fond of these - principal of minimum effort - IH
- hive off all tree-estimating functionality into a separate program, xtree
- retain current tree-estimating functionality, but allow substitution model for tree EM to be explicitly specified in grammar file
- upgrade tree-estimating code to use entire phylo-grammar, not just a single chain (re-promoted to main wishlist -- 3/31/2007)
- Posterior probability output compatible with Mathematica/gnuplot formats would be nice. - RobertBradley - 25 Apr 2006 22:20:38
- Converting the current format into the gnuplot data file format seems straightforward using a Perl one-liner or similar - IanHolmes - 25 Apr 2006 22:54:14
- How about a command line parameter to score an alignment and its reverse complement, like "--both-strands" in RNAz? - YuriBendana - 05 Oct 2006 23:25:31
- The new DartPerl:windowlicker.pl script can do this. Because xrate grammars can be dual-stranded (and the grammar includes explicit syntax for reverse complementing emissions), I think this is functionality that belongs either in the grammar itself, or in an external program (like windowlicker) - IanHolmes - 01 Apr 2007 01:15:46