Newsletter, September 7, 2005
....DART news.... ....Berkeley.... ....September 2005....
(Scroll down for the
DART website at the foot of this email.)
Hello there, fellow evolutionary hackers, RNA tinkerers and
phylo-enthusiasts! This is the latest missive on dart-announce, the
low-volume mailing list describing developments with the
DART package,
your FAVOURITE software for evolutionary bioinformatics... er, OK, my
British upbringing demands more disclaimers in that sentence. How about:
your favourite software for FAST evolutionary bioinformatics... counting
only open source programs WRITTEN IN C++...... in Berkeley........ by me.
yeah, that shouldn't offend anyone. (WE RULE!!!!) (
ahem)
If nothing else, dart SURELY must be your favourite GPL'd software that
comes with an opportunity to SCORE some
FREE BOOZE[*]! (Let's see if that
makes it past the spam filters.) That's right:-- the 2004 bug hunt,
despite being prolonged well into 2005, has finally reached its inevitable
conclusion.
CarolinKosiol, working with
NickGoldman's group, found some
cool sparse data glitches while experimenting with "xrate" on 61-state
codon models. Yes, that means rate matrices with an almighty 61*61
parameters! (Give or take a factor of 2 due to reversibility constraints.)
Hopefully the problems disclosed by Carolin et al have now been fixed:
both in the sense that we have patched the code (adding pseudocounts and
various checks for robustness), and in the sense that we have effectively
SILENCED Carolin and crew by sending them a crate of 12 bottles of some of
the finest wines produced in my old stomping ground, the Yarra valley in
Victoria, Australia. (Remember that old
Monty Python sketch about
Australian wines: "they really open up the sluice-gates at both ends".) So
that should be the last we hear from Carolin and Nick for a while (well,
that's the theory anyway). Anyone else who reckons they're hard enough,
come and have a go! Best bug found by summer 2005 wins the discoverer a
similarly punishing alcoholic stupor.
(A shout out is due to
Sam Griffiths-Jones and
AlexBateman, of
RFAM, who
came a close second with their heavy-duty testing of the
StemLoc program;
much valuable feedback, thanks guys. Maybe next year the booze will be
yours...)
[*] I believe that
EwanBirney has stopped doling out champagne as a bug
prize for his
GeneWise programs, though I'm prepared to be corrected.
Ewan?
BTW, lest it escape anyone's attention, the point of doing this kind of
bughunting prize (aka "pandering to beta testers") is that, as a result,
we have VERY FEW BUGS. For example: to my knowledge, dart programs never
segfault. Ever. Some of the more bloated RNA alignment algorithms have
been known to abort when they run out of memory, but that's par for the
course. Dart is
very robust software and we intend to keep it that way
(and bribe off any dissenters with fine wines, muahahahaa).
Moving on: XRate, the abovementioned software for estimating
instantaneous rate matrices, continues to be our most popular tool (but
closely followed by
StemLoc; see below). This year, Pete Klosterman has
worked to adapt XRate to estimate
irreversible rate matrices, using a
generalisation of the eigenspace-EM algorithm that powers the reversible
version. (Much credit should also go to
GertonLunter,
RobertDavies and
others who were kind enough to contribute asymmetric linear algebra
codes.)
XRate has also been adapted to incorporate arbitrary stochastic grammars
(including algorithms like Thorne-Goldman-Jones for predicting secondary
structure of proteins, and Knudsen-Hein for that of RNA). We have also
implemented the basic Siepel-Haussler technique for context-sensitive
substitutions (e.g. incorporating CpG effects, or basepair stacking in
RNA; irreversibility can be an important consideration in this sort of
model). As a result, we now have spin-off programs for working with
evolutionary or "phylo"-HMMs/SCFGs (XFold/XProt, for RNA/protein,
respectively).
XRate (and its spinoffs XFold and XProt) have been applied in a number
of "big genomics" projects this year, including ENCODE and analysis of
the 12 fruitfly genomes. We hope to expand this.
In other developments, the Handel program (MCMC sampling of multiple
sequence alignments) can now be used together with elaborate
phylo-grammar programs (like XFold and XProt) to sample from the
likelihood function of such phylo-grammars, using a Metropolis-like
accept/reject scheme. That is, you can pipe the sampled alignments from
Handel directly into XFold, and thereby explore XFold's alignment space
using the Handel sampling engine. We're still testing this capability
but expect it to be very powerful. As always, you can download the code
from
SourceForge (advance warning: we may move the CVS repository to our
own machines in the near future; anonymous CVS access will still be
offered).
Our RNA-oriented tools are also still going strong. Development continues
on "evoldoer" (our RNA evolutionary aligner) and "StemLoc" (for RNA
multiple alignment). Our current favoured way to proceed with RNA multiple
alignment is an MCMC-type approach. We're about to invest in a machine
with 32GB of RAM (currently we're limited to 8GB) specifically so we can
play around with some of these high-memory RNA algorithms. Keep watching
this space.
Anyway, enough rambling from me. Keep hacking away. See you all at
ISMB in Brazil, or Benasque'06, or some such cool compbio venue.
Ian Holmes, ihh at berkeley dot edu
UC Berkeley, Dept of Bioengineering
September 7, 2005
DART website:
http://biowiki.org/dart
Dart release notes, Release 0.2, March 2005
(1) A Wiki-based website for
DART has been set up
here -- please take a look, and add your comments.
Since it's a Wiki website, anyone can edit it and add information on
bugs, wishlist feature requests, general comments and flames. This will be
the primary resource for finding the latest information on
DART, as well
as miscellaneous genomics gossip. So please drop by.
(2)
StemLoc, the comparative RNA structure-finder, is now much improved,
with numerous additional features including better grammars (and hence
better alignments), multiple alignment, optional structure constraints (a
la RSEARCH) and/or alignment constraints (a la QRNA), and dotplots;
(3) the
EvolDoer program, which does evolutionary RNA modeling & alignment
is now bundled with dart (Holmes I.
A probabilistic model for the evolution of RNA structure. BMC Bioinformatics. 2004 Oct 26;5:166.);
(4) Handel, the MCMC alignment sampler, can now do importance sampling, so
that it can be used to explore the posterior alignment distribution
implied by any alignment likelihood function;
(5) There have been several improvements to
XrateProgram, including an
implementation of neighbor-joining for people who don't want to make their
own trees;
(6) A number of small utility programs are included with the release;
(7) The code has been upgraded to work with the latest release of the gcc
compiler, 3.3 (on Apple OSX platforms) and 3.4.2 (on Linux).
Newsletter, July 2004
Hi all,
I promised this would be a low-volume mailing list, but two messages per
year might be a bit slack... so to compensate, here is a newsletter.
GCC 3.4 compatibility
Release 0.2 of Dart is on its way, but I haven't yet found time to make a
tarball (or update the tutorial), and I keep adding little things.
The code currently in the repository works with the latest version of gcc
(3.4.1) and also is backward-compatible with gcc 3.3.* (as used in the
pseudo-parallel universe of Mac OS X).
RNA multiple alignment
More excitingly, stemloc does RNA multiple alignment! how cool is that?
As you, discerning User, have come to expect, the code is blindingly fast.
It errs on the side of speed (& low memory) versus sensitivity. You can
make it more sensitive by playing with the "-nf" switch. HOWEVER:
It's quite easy to max out the memory on your machine. This means you need
MORE RAM! There may be cooler hackier ways to constrain the Sankoff
algorithm than the ones
DART currently knows about. But, ultimately, none
of us can shirk the duty of buying a 64-bit box
Today's DART top tip - logging
One quick-and-dirty way to get more info about what's going on inside
DART
is to access the built-in logging diagnostics -- pretty much every
debugging output method I've ever written is accessible by using the right
"-log ..." directive. Type "xrate -logtags" or "xrate -logtaginfo" to get
a list (substitute e.g. "stemloc" for "xrate" depending on what program
you're using).
As an example, the relevant logtags for "xrate" begin with "RATE_EM" &
you can grep for them in the file
dart/src/hsm/em_matrix.cc