
: This tutorial was originally written in 2004. The stemloc grammar has changed
slightly since then. The tutorial has been updated through Section 3, but log-likelihood scores,
alignments, etc. from tutorial examples in the other sections
may be slightly different from what you obtain now. Please contact
IanHolmes or
RobertBradley if you notice any discrepancies between the tutorial and the behavior you experience.
Contents
1. Introduction
StemLoc is a program for doing pairwise alignment of RNA sequences. It
attempts to predict the secondary structure at the same time, using a
variant of an algorithm first proposed by Sankoff
[1], with some
heuristics formally known as "envelopes"
[2] and informally as
"go-faster stripes".
This tutorial describes how to use stemloc by example. It assumes that
you have the
DART package installed.
DART can be downloaded from
SourceForge.
You'll then need to build it (requires gcc 3.2 or later; see INSTALL file in top-level directory).
The examples in this tutorial are all found in the subdirectory
'dart/src/stemloc', so 'cd' into that directory once
DART is built.
We will start with a quick tour of some of stemloc's features, using
two Asp-coding tRNA sequences distributed with the DYNALIGN package.
To try a local alignment, type the following
stemloc --local dynalign.trna
You should see the following output
# STOCKHOLM 1.0
#=GR RD0260/26-67 SS ..<<<<<.......>>>>>.....(<<<<.......>>>>).
RD0260/26-67 UACUCCCCUGUCACGGGAGAGAAUGUGGGUUCAAAUCCCAUC
#=GC PS_cons UAC..CCCUGUCACGG..G.GA..G.GGGUUC.AAUCCC..C
RD0500/26-66 UACGACCCUGUCACGGUCGUGA-CGCGGGUUCgAAUCCCGCC
#=GR RD0500/26-66 SS ..<<<<<.......>>>>>...-.<<<<<.......>>>>>.
#=GC SS_cons ..<<<<<.......>>>>>.....<<<<<.......>>>>>.
#=GF SC 31.872
//
This output is in
Stockholm format, the format
used by Rfam. It shows the sequence names, the co-ordinates of the
matches, the alignment, the consensus primary sequence,
the secondary structure of each sequence, the
consensus secondary structure, and the
log-odds score of the alignment in bits. The "//" line is used to
separate alignments (here there is only one alignment, so it indicates the
end of the file). Stockholm format is described in full
here.
[To convert Stockholm format into CT format (as used by NAVIEW in the
MFOLD package, and various other programs), a Perl script called
'stockholm2ct.pl' is provided in
DART subdirectory 'dart/perl'. See
the
Stockholm tools wiki page for this and more
(e.g. to convert to and from FASTA format).]
The above alignment is a
local alignment. Since we know that these two particular sequences
are entire tRNA genes, it's reasonable to ask for a global alignment.
To do this, type
stemloc --global dynalign.trna
(actually, stemloc performs global alignment by default,
so 'stemloc dynalign.trna' is the same as the above command),
which should yield the following result
# STOCKHOLM 1.0
#=GR RD0260/1-77 SS <<<<<<<..................-------..<<<<<.......>>>>>.....(<<<<.......>>>>)>>>>>>>....
RD0260/1-77 GCGACCGGGGCUGGCUUGGUAAUGG-------UACUCCCCUGUCACGGGAGAGAAUGUGGGUUCAAAUCCCAUCGGUCGCGCCA
#=GC PS_cons GC....G.GG........GUA.UGG.......UAC..CCCUGUCACGG..G.GA..G.GGGUUC.AAUCCC..C....GCGCCA
RD0500/1-76 GCCCGGGUGGU-------GUAGUGGCCCAUCAUACGACCCUGUCACGGUCGUGA-CGCGGGUUCgAAUCCCGCCUCGGGCGCCA
#=GR RD0500/1-76 SS <<<<<(<....-------................<<<<<.......>>>>>...-.<<<<<.......>>>>>>)>>>>>....
#=GC SS_cons <<<<<<<...........................<<<<<.......>>>>>.....<<<<<.......>>>>>>>>>>>>....
#=GF SC 27.99
//
This is better; we've got three out of the four cloverleaf stems.
The "correct" structural alignment according to the
Bayreuth tRNA database is:
RD0260 GCGACCGGGGCUGGCUUGGUA-AUGGUACUCCCCUGUCACGGGAGAGAAUGUGGGUUCAAAUCCCAUCGGUCGCGCCA
#=GR RD0260 SS <<<<<<<..<<<<........-.>>>>.<<<<<.......>>>>>.....<<<<<.......>>>>>>>>>>>>....
RD0500 GCCCGGGUGGUGUAGU-GGCCCAUCAUACGACCCUGUCACGGUCGUGA-CGCGGGUUCGAAUCCCGCCUCGGGCGCCA
#=GR RD0500 SS <<<<<<<..<<<....-.......>>><<<<<<.......>>>>>>..-.<<<<<.......>>>>>>>>>>>>....
The grammar does not pick up the fourth stem (the
D arm): it prefers to place two gaps to align the common "GUA.UGG".
We can ask stemloc to score the true alignment and structure:
% stemloc dynalign.trna.aligned.folded
# STOCKHOLM 1.0
#=GR RD0260/1-77 SS <<<<<<<..(<((........-.))>)(<<<<<.......>>>>>)....(<<<<.......>>>>)>>>>>>>....
RD0260/1-77 GCGACCGGGGCUGGCUUGGUA-AUGGUACUCCCCUGUCACGGGAGAGAAUGUGGGUUCAAAUCCCAUCGGUCGCGCCA
#=GC PS_cons GC....G.GG.....U.GG...AU..UAC..CCCUGUCACGG..G.GA..G.GGGUUC.AAUCCC..C....GCGCCA
RD0500/1-76 GCCCGGGUGGUGUAGU-GGCCCAUCAUACGACCCUGUCACGGUCGUGA-CGCGGGUUCGAAUCCCGCCUCGGGCGCCA
#=GR RD0500/1-76 SS <<<<<(<..(<<(...-......)>>)<<<<<<.......>>>>>>..-.<<<<<.......>>>>>>)>>>>>....
#=GC SS_cons <<<<<<<..<<<<..........>>>><<<<<<.......>>>>>>....<<<<<.......>>>>>>>>>>>>....
#=GF SC 15.895
//
'dynalign.trna.aligned.folded' contains the true alignment as well as the Bayreuth structural information; stemloc recognizes this and uses it as constraints.
We can also ask for the best structure consistent with the true alignment:
% stemloc dynalign.trna.aligned
# STOCKHOLM 1.0
#=GR RD0260/1-77 SS <<<<<<<..............-......<<<<<.......>>>>>.....(<<<<.......>>>>)>>>>>>>....
RD0260/1-77 GCGACCGGGGCUGGCUUGGUA-AUGGUACUCCCCUGUCACGGGAGAGAAUGUGGGUUCAAAUCCCAUCGGUCGCGCCA
#=GC PS_cons GC....G.GG.....U.GG...AU..UAC..CCCUGUCACGG..G.GA..G.GGGUUC.AAUCCC..C....GCGCCA
RD0500/1-76 GCCCGGGUGGUGUAGU-GGCCCAUCAUACGACCCUGUCACGGUCGUGA-CGCGGGUUCGAAUCCCGCCUCGGGCGCCA
#=GR RD0500/1-76 SS <<<<<(<.........-...........<<<<<.......>>>>>...-.<<<<<.......>>>>>>)>>>>>....
#=GC SS_cons <<<<<<<.....................<<<<<.......>>>>>.....<<<<<.......>>>>>>>>>>>>....
#=GF SC 19.703
//
'dynalign.trna.aligned' contains the true alignment but no structural information. In this particular case, the fourth stem doesn't appear to score well under the grammar.
The above examples use several command-line options that will be
explored in this tutorial. To get a full list of available options,
type
stemloc --help
or just
stemloc -h
Stemloc has a comprehensive logging system. To get a flavor, try
stemloc -log CFGDP dynalign.trna
You may notice that this takes slightly longer. The program should
print a series of log messages something like this
CYK: finished 13398 subseqs (14.6%) and 6018180 bifurcations (7.5%)
CYK: finished 16883 subseqs (18.4%) and 9031271 bifurcations (11.3%)
[...]
CYK: finished 91590 subseqs (99.8%) and 79010219 bifurcations (99.4%)
CYK: finished 91751 subseqs (100%) and 79450928 bifurcations (100%)
To get more information about logging options, type
stemloc -loghelp
2. The algorithm
The technology underpinning stemloc is called the "Pair Stochastic
Context-Free Grammar" or just "Pair SCFG". Chances are, you've heard
of SCFGs (statistical language models, originally invented by the
legendary Noam Chomsky, imported from Natural Language Processing by
David Haussler, Sean Eddy and others). A Pair SCFG is simply an SCFG
for two sequences, which is what you need to do pairwise alignment.
Algorithmically, multiple-sequence SCFGs are related to Sankoff's
dynamic programming algorithm for simultaneously folding and aligning
multiple RNA sequences
[1] (in fact, they can be viewed as a scoring
scheme for this algorithm). Sankoff's algorithm considers all possible
folds and all possible alignments of the two sequences, so it is
thorough, but it is slow: for N sequences, each of length L, Sankoff's
algorithm takes time O(L^(3N)) and memory O(L^(2N)), which basically
puts it out of reach of anyone except the NSA. In 2003, you can
probably do the above example (pairwise folding of tRNA) on a home
computer, but not very quickly. The time and memory complexity of
multiple-sequence SCFG algorithms are the same as for Sankoff's
algorithm. So, some acceleration (and compactification) is called for,
and that's what the stemloc algorithm does.
Rather than considering
all folds and
all alignments of the two
sequences (which, since the number of sequences is N=2, takes time
O(L^6), using a Pair SCFG), stemloc allows you to limit this, in a
flexible and user-defined way. Specifically, you can
pre-fold each
sequence individually (using a Single-sequence SCFG, which takes time
O(L^3)), and you can
pre-align the two sequences, ignoring the
secondary structure and focussing only on primary sequence homology
(using a Pair HMM, which takes time O(L^2)).
The first of these steps -- pre-folding -- gives you a set of
candidate structures for each sequence, while the second step --
pre-alignment -- gives you a set of candidate alignments.
StemLoc then
uses these candidate folds & alignments to
constrain the Pair
SCFG. Specifically, any secondary structure or alignment that is
inconsistent with the candidate folds or alignments is not even
considered while doing the Pair SCFG alignment step. Therefore,
stemloc gives you a tradeoff of speed against completeness.
Furthermore, this tradeoff is highly tuneable, as will be seen below.
3. Visualising the algorithm
In stemloc terminology, the set of candidate folds is referred to as
the "fold envelope", while the set of candidate alignments is referred
to as the "alignment envelope" (or sometimes the "pair envelope"). It
is possible to visualise both these envelopes as stemloc runs.
Try typing the following
stemloc nanos-tiny.rna -fast -log DOTPLOT
The '-fast' flag tells stemloc to consider just the 100 best RNA structures
rather than sampling all folds.
If the output scrolls past too fast to see, try the following (this
just pipes standard error through 'less' with control characters
preserved, in case you're interested)
stemloc nanos-tiny.rna -fast -log DOTPLOT |& less -r
The sequences used here are very short (stem II from the TCE
(Translational Control Element) in the 3'UTR of the nanos gene in
fruitflies); they are just intended to illustrate the algorithm. In
the output, you should see six ASCII art-like dotplots rendered in
ANSI terminal color. The titles of these dotplots are
Banded folding pre-envelope for 'DVU24695_3utr'
Sampled fold envelope for 'DVU24695_3utr'
Banded folding pre-envelope for 'DRONANOS_3utr'
Sampled fold envelope for 'DRONANOS_3utr'
Banded alignment pre-envelope for 'DVU24695_3utr' vs 'DRONANOS_3utr'
Sampled alignment envelope for 'DVU24695_3utr' vs 'DRONANOS_3utr'
Here are screenshots of these images (PNG format):
 |
 |
| 1. Fold pre-envelope (Drosophila virilis) |
2. Fold envelope (Drosophila virilis) |
 |
 |
| 3. Fold pre-envelope (Drosophila melanogaster) |
4. Fold envelope (Drosophila melanogaster) |
 |
 |
5. Alignment pre-envelope (both species) |
6. Alignment envelope (both species) |
The first dotplot, "Banded folding pre-envelope for 'DVU24695_3utr'",
shows
all possible basepairs for the first of the two sequences
(called 'DVU24695_3utr'), while the second dotplot shows the basepairs
that have made it into the fold envelope. Likewise, the third dotplot
shows all possible basepairs for 'DRONANOS_3utr', while the fourth
dotplot shows those in the fold envelope. Finally, the fifth dotplot
shows all possible residue-level alignments between a residue from
'DVU24695_3utr' and a residue from 'DRONANOS_3utr', while the sixth
dotplot shows residue-level alignments that have made it into the
envelope. Additional information is provided in the form of color: the
"brighter" colors (from red to white, as indicated by the color
code at the bottom of the dotplot) represent cells that made it into
the envelope earlier; in other words, if the threshold for creating
the envelope was raised, the red cells would be the first to be
dropped from the envelope.
Here is another example for the two tRNA sequences from the introduction,
which are slightly larger
 |
 |
 |
| Fold envelope for RD0260 from Phage T5 |
Fold envelope for RD0500 from Haloferax volcanii |
Alignment envelope for RD0260 and RD0500 |
(obtained with 'stemloc dynalign.trna -fast --local -log DOTPLOT').
The dotplots can be quite informative. For example, it is clear that
the above alignment envelope (the third & final image above) is local, not
global, since the 5' ends of the tRNAs are not aligned. (See below for
the global alignment envelope for these sequences.)
The command-line option "-log DOTPLOT" is an example of a
DART logfile
directive. It's possible to get quite a lot of information out of a
DART program via logging directives. You can specify individual
logfile "tags" (as in the above example "DOTPLOT") or just specify a
log level (the lower the log level, the more information you get). You
can also ask for logging directives from specific parts of the source
code, though it's unlikely you'll want to use this.
Logging directives can be used in combination. For example, try
stemloc nanos-tiny.rna -log DOTPLOT -log 6 -log CFGDP
This displays all logging messages with tags "DOTPLOT" and "CFGDP", as
well as all log messages of level 6 or above. (Roughly speaking, level
9 is silent; level 6 is chatty; level 3 is verbose; and level 0 will
fill your hard drive. You can also use all the levels in between.)
Logging messages are printed to standard error, unless the '-logfile LOG'
option is specified, in which case they are sent to a logfile called
LOG, or whatever you specify. (They're also wrapped in XML; see below.)
The above example introduced a particular logging tag: '-log CFGDP'.
This displays log messages like the following
CYK: finished 39 subseqs (100%) and 928 bifurcations (100%)
These messages indicate the progress of the dynamic programming (DP)
algorithm, which can be reassuring for long runs. "CYK" is the name of
a particular DP algorithm; "subseqs" and "bifurcations" are different ways
of indicating the progress of the algorithm (subseqs is a memory-related
progress indicator, while bifurcations is time-related). For more detailed
explanations of these terms, refer to the SCFG literature, e.g.
[2] or
[3].
Some more log tags include
-
ALLOC
This displays a short message before attempting to allocate large
amounts of memory. In times gone by, this was the only way of anticipating an
out-of-memory error, which (unsatisfyingly) can sometimes cause Unix
to hang for ages before admitting defeat.
Nowadays, stemloc includes an option (-mb) to truncate the alignment envelope so as to limit the maximum number of megabytes that will be allocated during the simultaneous-alignment-and-folding step.
-
STEMLOC_FOLDENV, STEMLOC_PREFOLD, STEMLOC_BANDED_FOLDENV, STEMLOC_SAMPLED_FOLDENV
STEMLOC_PAIRENV, STEMLOC_PREALIGN, STEMLOC_BANDED_PAIRENV, STEMLOC_SAMPLED_PAIRENV
These are more specific versions of 'DOTPLOT', allowing you to print
individual dotplots, e.g. only the fold envelope dotplots
(STEMLOC_FOLDENV), or only the dotplot corresponding to the final
sampled fold envelope (STEMLOC_SAMPLED_FOLDENV), or only the dotplot
corresponding to alignments pre-specified as in section [4](b) below
(STEMLOC_PREALIGN), etc.
Other log tags are introduced throughout the tutorial.
To wrap up the topic of logging, it was mentioned above that log
messages saved in the logfile are stored in a simple XML format. This
format encodes the date, time and source-code location of all logfile
messages, along with the corresponding log level and tags. There is a
Perl script in the 'dart/perl' directory, called 'dartlog.pl', that
unpicks this format and renders the (optionally filtered) log messages
(together with the extra information) in ANSI color, piping the output
through 'less -rF'. You can also get the XML tags to appear on the
standard error version of the logfile, using the '-lvb' option (short
for '--logverbose'). If you use these options, you definitely count as
a power user.

Screen captures and examples in the preceding sections of the tutorial have been updated to reflect
changes made in the current version of stemloc. You may discover that the fold envelopes, bit scores, etc.
you produce may differ from those given in the following sections. For example, stemloc now considers
all folds by default ('--nfold -1') rather than limiting the search to the best 100 structures.
4. Tuning the envelopes
The performance of stemloc is often quite sensitive to the envelope
cutoff threshold; set the threshold too high, and you may miss the
crucial fold or alignment. Conversely, if the threshold is too low,
the envelopes will be too large, so that the Pair SCFG may require too
much time or memory to be practical.
You can tune the threshold using the options '-nf' and '-na', which
set the number of folds and alignments (respectively) that are sampled
to create the envelopes. In each case, you can set the respective
parameter to -1 to unlimit the number of folds/alignments sampled.
For example, the following command
stemloc nanos-tiny.rna -nf -1 -na -1
runs the full Sankoff algorithm on the dataset 'nanos-tiny.rna'.
The following images illustrate the effect on the fold envelope
of increasing the '-nf' parameter for the two example tRNAs RD0260 (top) and RD0500 (bottom)
 |
 |
 |
 |
| '-nf 1' |
'-nf 5' |
'-nf 20' |
'-nf 200' |
 |
 |
 |
 |
And here's what happens to the global alignment envelope
if you gradually increase the '-na' parameter for these two sequences
 |
 |
 |
 |
| '-na 1' |
'-na 5' |
'-na 20' |
'-na 200' |
You can also use the
-mb option to truncate the number of alignments added to the alignment envelope so as to limit the maximum number of megabytes that will be allocated during the simultaneous-alignment-and-folding step.
Most alignments do not contain huge indels, so that the alignment
generally does not wander far from the main diagonal of the DP matrix
(or alignment dotplot). You can restrict the alignment envelope to be
within a certain distance of the diagonal using the '-band' option.
Similarly, if you are looking for (say) short regulatory elements in a
long 3'UTR, you may not want to consider RNA structures where very
distant bases are paired. You can restrict the maximum stem length
allowed in the fold envelope using the '-len' option.
Here is what dotplots of banded fold and alignment pre-envelopes look
like for the prokaryotic tRNA examples from the introduction,
using the banding options '-len 30' and '-band 10'
|
|
| Fold pre-envelope for RD0260, option '-len 30' |
Alignment pre-envelope for RD0260 vs RD0500, option '-band 10' |
Putting these together, the following is a more realistic attempt to
find the TCE in the nanos 3'UTR
stemloc Nanos-Drosophilae-virilis-melanogaster.fa -len 100 -band 50 -log CFGDP -log 6
You can also pre-specify the fold or alignment envelopes, simply by
passing in the data as an (optionally annotated) Stockholm alignment,
instead of as a FASTA file. Effectively, you have three options:
(a) you can specify the structure of a sequence, but not its
alignment, by putting the sequence in its own alignment and
annotating it with a "#=GR
SS" line (see Stockholm format
documentation for details). For example,
DVU24695_3utr gaagcucuggcagcuuu
#=GR DVU24695_3utr SS <<<<<<.....>>>>>>
//
DRONANOS_3utr gaggcucuggcagcuuu
#=GR DRONANOS_3utr SS <<<<<<.....>>>>>>
//
Alternatively, you could specify the structure of just one of the
sequences, and leave all the others un-annotated (or pass them in
separately in a FASTA file; stemloc can take multiple input
files). In this mode, and with the alignment envelope unlimited
('-na -1'), stemloc is behaving a little like the
RSEARCH program (though it lacks certain features of RSEARCH,
such as the assignment of statistical significance by fitting an
Extreme Value Distribution to the search results).
(b) you can specify the alignment of two (or more) sequences, but not
their structure. For example,
DVU24695_3utr gaagcucuggcagcuuu
DRONANOS_3utr gaggcucuggcagcuuu
//
In this mode, and with the fold envelope unlimited ('-nf -1'),
stemloc is behaving a little like the QRNA program
(though, again, it lacks many features of QRNA, such as the
three-way model comparison that discriminates RNA genes from
coding DNA sequences and neutral homology).
(c) you can specify both the alignment and the structure of two (or
more) sequences, as follows
DVU24695_3utr gaagcucuggcagcuuu
#=GR DVU24695_3utr SS <<<<<<.....>>>>>>
DRONANOS_3utr gaggcucuggcagcuuu
#=GR DRONANOS_3utr SS <<<<<<.....>>>>>>
//
Note that it is not, in general, sufficient to specify a consensus
fold for the entire alignment (via a Stockholm "#=GC SS_cons" line); you
must specify individual folds for each sequence. (A program to
propagate consensus folds to individual sequence folds, called
"gc2gr-ss", is distributed as part of DART.)
Options (b) and (c) can be useful for debugging, or to understand why
stemloc is giving you the results that it does. For example, to get an
idea of why stemloc only finds stem II of the TCE, and not the full
three-way stem junction, try the following
stemloc Nanos-Drosophilae-virilis-melanogaster.folded -g -na -1
or
stemloc Nanos-Drosophilae-virilis-melanogaster.aligned -g
The score is quite negative, probably because the structure is so full
of indels. One way to approach this kind of problem is to try training
the program on gappier data. More on this later.
By default, the fold (and alignment) envelopes are creating by
sampling the best N folds (alignments) for each sequence (pair), where
N is given by the '-nf' option (or '-na'). You can change this
behaviour, so that stemloc samples a random set of folds and/or
alignments, using the '-rf' and '-ra' options. This may be quicker in
some cases (about twice as fast; the technical reason is that it
doesn't require an Outside pass through the DP matrix), but it can
be less predictable. To ensure that the stochastically sampled
envelopes are, at least, reproducible, you can use the '-rndseed'
option to seed the random number generator with a particular
value. (By default, the random number generator is seeded on the
current clock time, so consecutive runs with sampled envelopes may
give different results each time.)
Here are examples of a stochastic fold envelope and a stochastic
alignment envelope for the two prokaryotic tRNAs
|
|
| Stochastic fold envelope for RD0260 |
Stochastic alignment envelope for RD0260 vs RD0500 |
As mentioned in the introduction, you can force global alignment using
the '-g' option. Depending on whether you selected global or local
alignment, the pre-alignment step is also global or local.
Here is the deterministic global alignment envelope for the two tRNAs (together with
the local alignment envelope again, for comparison)
|
|
| Global alignment envelope |
Local alignment envelope |
You can ask for more than one alignment for each sequence pair using
the '-mh' ("max hits") option, which limits the number of hits
returned for each pair (by default, this is 1). This does a
Waterman-Eggert-style masking after each reported hit. You can also
ask for only hits above a minimum score threshold, using the '-ms'
("min score") option.
Occasionally, you may get an error message because the score for a
particular sequence pair is -infinity. This can happen if the
envelopes are too restrictive. For example, if you ask for only one
alignment and one fold of each sequence
stemloc -na 1 -nf 1 dynalign.trna
you get the following result
Warning: alignment score for 'RD0260' and 'RD0500' is -infinity; skipping.
(this usually means there is no valid alignment of the sequences, which often means that the precomputed structures or alignments are too restrictive;
try setting -nf or -na higher?)
You can see exactly why this is by setting the POSTENV log tag, which
displays each sampled fold & alignment
stemloc -na 1 -nf 1 dynalign.trna -log POSTENV
Finding 1 best RNA secondary structures for fold envelope
Sequence GCGACCGGGGCUGGCUUGGUAAUGGUACUCCCCUGUCACGGGAGAGAAUGUGGGUUCAAAUCCCAUCGGUCGCGCCA
Fold 1 <<<<<<<(.........................................................)>>>>>>>.... (score -15.343 bits)
Finding 1 best RNA secondary structures for fold envelope
Sequence GCCCGGGUGGUGUAGUGGCCCAUCAUACGACCCUGUCACGGUCGUGACGCGGGUUCgAAUCCCGCCUCGGGCGCCA
Fold 1 <<<<<<<(<<((<(((((<<<(((((<<<<<<(<<.>>)>>>>>>)))))>>>)))))>))>>)>>.>>>>>.... (score -12.656 bits)
Finding 1 best pairwise alignments for alignment envelope
Adding state path to alignment envelope
Alignment 1 (score 23.247 bits):
RD0260 gcgaccggggcuggcuugguaaugguacuc------------------------------cccugucacgggagagaauguggguucaaaucccaucggucgcgcca
RD0500 ------------------------------gcccgggugguguaguggcccaucauacgacccugucacggucgugacgc-ggguucgaaucccgccucgggcgcca
The single fold predicted for each sequence is incompatible with the
single best alignment, since the 3' ends of the stems are aligned but
the 5' ends are not. Hence, the fold and alignment envelopes are
incompatible, so the returned score is -infinity.
5. Training the parameters
One of the most powerful features of probabilistic models is the
ability to parameterise them automatically from data by maximising
the likelihood of a "training set". In the case of SCFGs (and other
models), this can be done quickly and efficiently using the
Inside-Outside algorithm, a special case of the Expectation
Maximisation (EM) algorithm.
Why would you want to train parameters yourself, when the default set
of parameters hat comes with stemloc is probably adequate? The answer
is that it may not be adequate for all cases. The default parameters
for stemloc were trained on a selection of pairwise alignments of
betweem 30% and 40% sequence identity from release 5.0 of Rfam. We
have already seen one case where these parameters are only partially
effective (the nanos TCE, above). Another case where they are quite
hopeless is the regulatory stem-loop in the K-10 3'UTR from
Drosophila. Try
stemloc k10-3utr.dna -log 6
and, for the "true" alignment,
stemloc k10-3utr.aligned
You can get a breakdown of the score for the true alignment as follows
stemloc k10-3utr.aligned -log CYK_EXPANDED_TRACE
Because stemloc was trained on low-identity alignments, it rewards
some covariant matches (e.g. UA-AU, or AU-GC) better than exact
identities (UA-UA). Also, it's not too fond of bulges, and it likes to
see conservation in bulges and loops. These are parameterisation
issues, and may (or may not) be fixable by training. Sometimes a
signal is just buried, and there's not much you can do about that, but
you can often get extra mileage by training on a dataset that more
closely reflects your own data.
StemLoc uses three different stochastic grammars. These are
- [F] a single-sequence SCFG for predicting the fold envelope
- [A] a pair HMM for predicting the alignment envelope
- [C] a pair SCFG for the full structural alignment comparison
Each of these grammars has its own parameterisation and can be
independently trained. In fact, you can do the following tasks with
each grammar:
- [L] Load the parameters from a file
- [S] Save the parameters to a file
- [T] Train the parameters from a Stockholm multiple alignment database
See the help text ('stemloc -h') for details on how to do these things.
When training, it is important not to overtrain; that is, to use a
dataset that contains insufficient data to train all the
parameters. For example, the fold grammar has 298 parameters: the bulk
of these come from the 256 = 16*16 different probabilities for all
possible base-pair stackings in an RNA double helix. To avoid
overtraining this grammar, you need a dataset that contains at least
one example of all 256 different stackings.
You can get an idea of how many parameters are required for each
grammar by saving the default parameters to a file (see 'stemloc -h' for details).
The parameters are stored in a machine-readable notation for SCFGs.
The training set must be a fully annotated Stockholm database like this
#=GF WT 1.0
DVU24695_3utr gaagcucuggcagcuuu
#=GR DVU24695_3utr SS <<<<<<~~~~~>>>>>>
DRONANOS_3utr gaggcucuggcagcuuu
#=GR DRONANOS_3utr SS <<<<<<~~~~~>>>>>>
//
#=GF WT 0.5
k10_dmel_utr3 CUUGAUUGUAUUUUUAAAUUAAUUCUUAAAAACUACAAAUUAAG
#=GR k10_dmel_utr3 SS <<<<<<<<<<<<<<<<<~~~~~~~~>>>>>>>.>>>>.>>>>>>
k10_dpse_utr3 CUUGAUUGUAUUUUUAGGUUCACUCUUAGAAAUCACAAAUUAAG
#=GR k10_dpse_utr3 SS <<<<<<<<<<<<<<<<<~~~~~~~~>>>>>>>.>>>>.>>>>>>
//
Here, the "#=GF WT" lines specify the weight that each alignment
contributes to the training. In the above example, the first alignment
contributes twice as much as the second alignment, exactly as if it
had been included twice.
Note the tilde (~) characters in the SS secondary structure lines.
These indicate to stemloc that the region is single-stranded AND un-bifurcated.
This is in contrast to a period (.) character, which indicates that a region is single-stranded but may still be bifurcated.
(A bifurcation is a special type of production rule in a context-free grammar that allows the parse tree
to fork, e.g. at a multiloop in an RNA structure, where the structure splits into two sub-structures.)
Bifurcations are very expensive and can really slow down training, especially in sequences like rRNA where you have long stretches of single-stranded sequence.
So if you want to force stemloc NOT to make bifurcations in a particular single-stranded region,
annotate that region with tilde characters in the SS line.
Although you are free to train your parameters every time you want to
do an alignment, typically you would want to train the parameters
once, save them to a file, then re-load them every time. To train (and optionally
load/save) the parameters without doing any alignment, use the '-to'
command-line option.
6. Some common questions
I want to align multiple RNA sequences
StemLoc now does this, in a variety of ways.
See pages for progressive mode, local mode, liberal mode, etc.
Unfortunately there is no tutorial for this functionality yet (write one!)
I want to predict the structure of a multiple RNA alignment
Stemloc can take pairwise alignments (in Stockholm format as input) and annotate them with secondary structure.
It does this quite well.
Stemloc will also attempt to annotate multiple alignments with structure, but it's less good at that.
Our xfold program, which uses an evolutionary model of basepair covariation, is a much better tool for this job.
I want to align/cluster RNA structures
You can do this with stemloc! Just supply it with a bunch of sequences that are annotated with RNA structures.
How do I do this, you ask? Put the sequences in a Stockholm format file, using one alignment per sequence.
(That is, each alignment just contains a single ungapped sequence.)
You can then use Stockholm's "#=GR SS" notation to annotate each sequence with a structure.
Note that you can also use this to align a bunch of unstructured sequences to a single reference structure.
Why does the program abort?
Probably because you've run out of memory.
There are several ways to avoid out-of-memory errors:
- Use the
-mb option to truncate the alignment envelope so as to limit the maximum number of megabytes that will be allocated during the simultaneous-alignment-and-folding step.
- Specify
--nocachefold to prevent stemloc from caching the fold envelopes in-memory. This can save a lot of memory for large datasets.
- Use progressive alignment instead of sequence annealing. Progressive is less accurate, but more memory-efficient.
How do I reference stemloc?
Cite [4] and [2], i.e.
References.
[1] Sankoff D. and Kruskal J.B., eds. 1983.
Time Warps, String Edits, and Macromolecules: the Theory and Practice of Sequence Comparison.
Addison-Wesley, Reading, MA.
[2] Holmes I. and Rubin G.M. 2002.
Pairwise RNA structure comparison using stochastic context-free grammars.
Proceedings, Pacific Symposium on Biocomputing.
[3] Durbin R., Eddy S., Krogh A. and Mitchison G. 1998.
Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids.
Cambridge University Press, Cambridge, UK.
[4] Holmes I. 2005.
Accelerated probabilistic inference of RNA structure evolution.
BMC Bioinformatics.