Dart Release Notes
Newsletter, September 7, 2005
....DART news.... ....Berkeley.... ....September 2005.... (Scroll down for the DART website at the foot of this email.)
Hello there, fellow evolutionary hackers, RNA tinkerers and phylo-enthusiasts! This is the latest missive on dart-announce, the low-volume mailing list describing developments with the DART package, your FAVOURITE software for evolutionary bioinformatics... er, OK, my British upbringing demands more disclaimers in that sentence. How about: your favourite software for FAST evolutionary bioinformatics... counting only open source programs WRITTEN IN C++...... in Berkeley........ by me. yeah, that shouldn't offend anyone. (WE RULE!!!!) (ahem)
If nothing else, dart SURELY must be your favourite GPL'd software that comes with an opportunity to SCORE some FREE BOOZE[*]! (Let's see if that makes it past the spam filters.) That's right:-- the 2004 bug hunt, despite being prolonged well into 2005, has finally reached its inevitable conclusion. Carolin Kosiol, working with Nick Goldman's group, found some cool sparse data glitches while experimenting with "xrate" on 61-state codon models. Yes, that means rate matrices with an almighty 61*61 parameters! (Give or take a factor of 2 due to reversibility constraints.)
Hopefully the problems disclosed by Carolin et al have now been fixed: both in the sense that we have patched the code (adding pseudocounts and various checks for robustness), and in the sense that we have effectively SILENCED Carolin and crew by sending them a crate of 12 bottles of some of the finest wines produced in my old stomping ground, the Yarra valley in Victoria, Australia. (Remember that old Monty Python sketch about Australian wines: "they really open up the sluice-gates at both ends".) So that should be the last we hear from Carolin and Nick for a while (well, that's the theory anyway). Anyone else who reckons they're hard enough, come and have a go! Best bug found by summer 2005 wins the discoverer a similarly punishing alcoholic stupor.
(A shout out is due to Sam Griffiths-Jones and Alex Bateman, of RFAM, who came a close second with their heavy-duty testing of the Stem Loc program; much valuable feedback, thanks guys. Maybe next year the booze will be yours...)
BTW, lest it escape anyone's attention, the point of doing this kind of bughunting prize (aka "pandering to beta testers") is that, as a result, we have VERY FEW BUGS. For example: to my knowledge, dart programs never segfault. Ever. Some of the more bloated RNA alignment algorithms have been known to abort when they run out of memory, but that's par for the course. Dart is very robust software and we intend to keep it that way (and bribe off any dissenters with fine wines, muahahahaa).
Moving on: XRate, the abovementioned software for estimating instantaneous rate matrices, continues to be our most popular tool (but closely followed by Stem Loc; see below). This year, Pete Klosterman has worked to adapt XRate to estimate irreversible rate matrices, using a generalisation of the eigenspace-EM algorithm that powers the reversible version. (Much credit should also go to Gerton Lunter, Robert Davies and others who were kind enough to contribute asymmetric linear algebra codes.)
XRate has also been adapted to incorporate arbitrary stochastic grammars (including algorithms like Thorne-Goldman-Jones for predicting secondary structure of proteins, and Knudsen-Hein for that of RNA). We have also implemented the basic Siepel-Haussler technique for context-sensitive substitutions (e.g. incorporating CpG effects, or basepair stacking in RNA; irreversibility can be an important consideration in this sort of model). As a result, we now have spin-off programs for working with evolutionary or "phylo"-HMMs/SCFGs (XFold/XProt, for RNA/protein, respectively).
XRate (and its spinoffs XFold and XProt) have been applied in a number of "big genomics" projects this year, including ENCODE and analysis of the 12 fruitfly genomes. We hope to expand this.
In other developments, the Handel program (MCMC sampling of multiple sequence alignments) can now be used together with elaborate phylo-grammar programs (like XFold and XProt) to sample from the likelihood function of such phylo-grammars, using a Metropolis-like accept/reject scheme. That is, you can pipe the sampled alignments from Handel directly into XFold, and thereby explore XFold's alignment space using the Handel sampling engine. We're still testing this capability but expect it to be very powerful. As always, you can download the code from Source Forge (advance warning: we may move the CVS repository to our own machines in the near future; anonymous CVS access will still be offered).
Our RNA-oriented tools are also still going strong. Development continues on "evoldoer" (our RNA evolutionary aligner) and "StemLoc" (for RNA multiple alignment). Our current favoured way to proceed with RNA multiple alignment is an MCMC-type approach. We're about to invest in a machine with 32GB of RAM (currently we're limited to 8GB) specifically so we can play around with some of these high-memory RNA algorithms. Keep watching this space.
Anyway, enough rambling from me. Keep hacking away. See you all at ISMB in Brazil, or Benasque'06, or some such cool compbio venue.
Ian Holmes, ihh at berkeley dot edu UC Berkeley, Dept of Bioengineering September 7, 2005
DART website: http://biowiki.org/dart
Dart release notes, Release 0.2, March 2005
(1) A Wiki-based website for DART has been set up here -- please take a look, and add your comments. Since it's a Wiki website, anyone can edit it and add information on bugs, wishlist feature requests, general comments and flames. This will be the primary resource for finding the latest information on DART, as well as miscellaneous genomics gossip. So please drop by.
(2) Stem Loc, the comparative RNA structure-finder, is now much improved, with numerous additional features including better grammars (and hence better alignments), multiple alignment, optional structure constraints (a la RSEARCH) and/or alignment constraints (a la QRNA), and dotplots;
(3) the Evol Doer program, which does evolutionary RNA modeling & alignment is now bundled with dart (Holmes &: A probabilistic model for the evolution of RNA structure. BMC Bioinformatics 2004;5:166.);
(4) Handel, the MCMC alignment sampler, can now do importance sampling, so that it can be used to explore the posterior alignment distribution implied by any alignment likelihood function;
(5) There have been several improvements to Xrate Program, including an implementation of neighbor-joining for people who don't want to make their own trees;
(6) A number of small utility programs are included with the release;
(7) The code has been upgraded to work with the latest release of the gcc compiler, 3.3 (on Apple OSX platforms) and 3.4.2 (on Linux).
Newsletter, July 2004
I promised this would be a low-volume mailing list, but two messages per year might be a bit slack... so to compensate, here is a newsletter.
GCC 3.4 compatibility
Release 0.2 of Dart is on its way, but I haven't yet found time to make a tarball (or update the tutorial), and I keep adding little things.
The code currently in the repository works with the latest version of gcc (3.4.1) and also is backward-compatible with gcc 3.3.* (as used in the pseudo-parallel universe of Mac OS X).
RNA multiple alignment
More excitingly, stemloc does RNA multiple alignment! how cool is that? As you, discerning User, have come to expect, the code is blindingly fast. It errs on the side of speed (& low memory) versus sensitivity. You can make it more sensitive by playing with the "-nf" switch. HOWEVER:
It's quite easy to max out the memory on your machine. This means you need MORE RAM! There may be cooler hackier ways to constrain the Sankoff algorithm than the ones DART currently knows about. But, ultimately, none of us can shirk the duty of buying a 64-bit box ;-)
Today's DART top tip - logging
One quick-and-dirty way to get more info about what's going on inside DART is to access the built-in logging diagnostics -- pretty much every debugging output method I've ever written is accessible by using the right "-log ..." directive. Type "xrate -logtags" or "xrate -logtaginfo" to get a list (substitute e.g. "stemloc" for "xrate" depending on what program you're using).
As an example, the relevant logtags for "xrate" begin with "RATE_EM" & you can grep for them in the file dart/src/hsm/em_matrix.cc