Xgram Wish List
xgram wish list
List of features to be added to xgram software, loosely ranked by priority (a murky tradeoff between expected benefit, ease of implementation, user demand & funlevel).
- Migrate this entire list to the xrate-dev queue on the Request Tracker -- done; IH 9/20/2007. RT tickets are linked below.
- Ticket:100 Multiple, probabilistic annotation tracks (i.e. allow probability distributions over annotation chars)
- Ticket:101 Inference of ancestral sequences (e.g. as posterior weight matrices)
- Ticket:102 A new gap model: explicitly specify which sequences must be ungapped. This allows explicit indel phylo-HMMs in the manner of Diallo et al.: Exact and heuristic algorithms for the Indel Maximum Likelihood Problem. J. Comput. Biol. 2007;14:446-61.
- Ticket:104 Allow any grammar to be used for tree estimation, not just null models
- This will require updating two distinct parts of the code:
- distance matrix calculations for neighbor-joining (requires computing pairwise distance for each pair of sequences);
- branch-length EM (requires computing expected counts for every time-dependent part of the likelihood, i.e. all chains & indel models).
- Stochastic modes of operation
- Ticket:107 Expected counts could be calculated in parallel, then summed, using a framework like map reduce. The observed-counts clause already provides a format for representing this (for parametric grammars)... in fact, this could be achieved hackily by wrapping xrate in several Perl scripts... perhaps it is premature to think about doing this without a concrete framework (e.g. Google knocking at our door offering to train phylogrammars for us on their compute farm)
- Ticket:108 Report equilibrium frequencies in grammar file, as well as initial probs (would be useful for thinking about evolutionary irreversibility c.f. Jordan et al.: A universal trend of amino acid gain and loss in protein evolution. Nature 2005;433:633-8.)
- Ticket:109 Length distributions over transitions into states (in manner of Kulp et al.: A generalized hidden Markov model for the recognition of human genes in DNA. Proc Int Conf Intell Syst Mol Biol 1996;4:134-42.)
- If not an explicit functional form, a variable fixing the maximum number of pseudoterminals that can be emitted by a particular state (e.g. for faster parsing of SCFGs when the approximate length of the covarying region is known beforehand)
- Rendered less necessary by sum-over-states traceback (sum-from) added 4/30/2006? -- Ian Holmes
- Sum-states can approximate a smoothed length distribution quite well, but using them is kind of involved
- Ticket:110 Take advantage of Stockholm format "tree number" (#=GF TN)
- Tree number selected in the grammar file
- moved back from xgram format WishList page: involves new functionality
- Different trees for different states
- interesting when you have different tree topologies in the same alignment
- Tree number selected in the grammar file
* e.g. due to recombination (c.f. Husmeier & Wright: Detection of recombination in DNA multiple alignments with hidden Markov models. J. Comput. Biol. 2001;8:401-27. & others) or HGT
- topology search: structural EM algorithm of Friedman et al.: A structural EM algorithm for phylogenetic inference. J. Comput. Biol. 2002;9:331-53.
- Ticket:111 Sparse chain state occupancy for features with strict consensus, e.g. splice sites
- Some infrastructure for this in DartSrc:ecfg/ecfg.h, need more (4/22/2006)
- By this do you mean an implementation of the sparse DP algorithm we discussed earlier? (anonymous commenter)
- No, this is about minimizing the cardinality of the state space. E.g. Burset et al.: Analysis of canonical and non-canonical splice sites in mammalian genomes. Nucleic Acids Res. 2000;28:4364-75. of mammalian splice junctions are GT-AG, GC-AG or AT-AC, so we should only really need 3 states (not 4^4=256) to model this in a phylogrammar. Since some of the training algorithms are in the number of states, this matters! (Actually, Strata Splice makes me wonder if GC content-binned PWM tracks wouldn't be better in this particular case, but the underlying point holds...) --IanHolmes
Completed wishlist items
- gff block improvements: (implemented 9/4/07 - IH)
- Allow gff blocks to be associated with null & bifurcation nonterminals, as well as emissions
- Allow reporting of Inside or CYK scores for multiple nonterminals in one gff block (not just the main nonterminal that triggers the GFF output) -- this will make odds-ratios of the form P(model)/P(null) easier to calculate
- Generation of simulated data from phylo-grammars (Done -- IH, 7/11/2007)
- Posterior expected counts
- counts and wait times in xgram format output file (now there as a "latent" feature, 4/22/2006) (now implemented via observed-counts, 2/11/2006)
- posterior probabilities for parse tree nodes in Stockholm format output
- CYK traceback only (currently implemented, 4/22/2006)
- all cells (implemented 7/13/2006)
- Option to ignore excessively gappy columns, as in Pfold
- (Done 2/19/2007 - IH) Lineage-dependent parameterisations (c.f. "local parameters" in HyPhy)
- (IH 2/12/2007) In preparation for this, various source files in dart/src/hsm/ and dart/src/ecfg/ have been marked up with comments indicating the changes that need to be made; grep text for "NB lineage-dependent models"
- current planned way of implementing these models is to introduce a new hybrid-chain tag which links together several "standard" chains (defined elsewhere), with different chains selected on different branches of the tree using the Stockholm format #=GS <tag> <value> syntax, e.g.
(hybrid-chain (terminal (HYB1 HYB2 HYB3)) (row HLABEL) ;; submodel COD1... selected by "#=GS [[Seq Name]] HLABEL GENE" ;; this first model is also assumed to be the default for unlabeled nodes ((label GENE) (terminal (COD1 COD2 COD3))) ;; submodel NULL1... selected by "#=GS [[Seq Name]] HLABEL PSEUDOGENE" ((label PSEUDOGENE) (terminal (NULL1 NULL2 NULL3))) )
- Priors (at least pseudocounts and wait times) for parameters
- There is now at least a crude facility to add pseudocounts to EM chain update statistics - IH, 5/20/2006
- This has now been upgraded to "proper", parameter-specific pseudocounts & pseudotimes - IH, 2/11/2007
- xrate should be a near xgram-clone like xprot & xfold (based on ECFG_main)
- Done 5/25/06 IH (the wrapper script "old_xrate.pl" in dart/perl exists for backward compatibility)
- Make behavior of tree-estimating code more intuitive by very simply patching a couple of preset grammars: again, a no-brainer
- Explanation: xfold/xgram/xprot currently have counter-intuitive behavior when trying to estimate a tree. For example, in xfold, if a preset grammar is used, then the HKY85 substitution model is used to estimate pairwise distances (and hence branch lengths). This occurs even if the HKY85 model is not explicitly present in the preset grammar. (For xprot, read "PAM" instead of "HKY85".) However, if a grammar is loaded from a file, then the substitution model for tree-estimation is taken directly from that grammar file. This means that if you save a grammar to a file, then re-load it, you get a different estimated tree. Very bad. Needs to be fixed.
- My favorite cos it's quick and easy - IH (now done, 5/25/06)
1 remove undocumented matrix preset behavior of xprot/xfold 1 add PAM and Kimura2 as dummy chains at the beginning of xprot/xfold grammars (actually this part has not been done, which was even quicker and easier than doing it) * (not HKY85 since quick-fitting eqm freqs is too dodgy) 1 document the new rule: branch lengths can be fit once, initially * eventually a more integrated EM schedule? branch-length priors?
- I've grown less fond of these - principal of minimum effort - IH
- hive off all tree-estimating functionality into a separate program, xtree
- retain current tree-estimating functionality, but allow substitution model for tree EM to be explicitly specified in grammar file
- upgrade tree-estimating code to use entire phylo-grammar, not just a single chain (re-promoted to main wishlist -- 3/31/2007)
-- Ian Holmes - 22 Apr 2006
- Posterior probability output compatible with Mathematica/gnuplot formats would be nice. - Robert Bradley - 25 Apr 2006 22:20:38
- Converting the current format into the gnuplot data file format seems straightforward using a Perl one-liner or similar - Ian Holmes - 25 Apr 2006 22:54:14
- How about a command line parameter to score an alignment and its reverse complement, like "--both-strands" in RNAz? - Yuri Bendana - 05 Oct 2006 23:25:31
- The new DartPerl:windowlicker.pl script can do this. Because xrate grammars can be dual-stranded (and the grammar includes explicit syntax for reverse complementing emissions), I think this is functionality that belongs either in the grammar itself, or in an external program (like windowlicker) - Ian Holmes - 01 Apr 2007 01:15:46