Calculating amino acid substitution rates
Mon Sep 11, 2006 11:46 PM
Toying around with XGram. Selected a multiple alignment
from the full species set (cluster 5052_1) and extract the
sequences:
perl ~/gpipe/extract_fasta.pl <(grep 5052_1 ~/projects/flies/release1v5/orthology_malis/full_species.map) \
< ~/projects/flies/release1v5/orthology_malis/step1.dir/cluster_5052.dir/cluster_5052.bl_mali > input.fasta
Then, translating everything to amino acids, removing gene ids
and converting to Stockholm format:
python ~/t/sequence2sequence.py --method=translate < input.fasta |\
perl -p -e "s/[|]\S+//" |\
python ~/t/mali2mali.py -i fasta -o stockholm -v 0 > input.stk
Seems that I have picked an alignment with plenty of polyQs. Might as
well....Also, the sequences are highly similar, more than 84% identity.
Adding the tree later, will first see about grammar.
Wed Sep 13, 2006 4:09 PM
Found the problem with gcc4.1 and
DART:
- extra qualifiers
- injected friend declarations
- non-template member functions in non-template classes inheriting from a template class.
The following page has all the info:
http://womble.decadentplace.org.uk/c++/syntax-errors.html
Thu Sep 14, 2006 11:40 AM
Running nullprot.eq on input.stk. I input the median ks tree, the output is
a new grammar with changed rates and an unchanged tree.
I played a bit around with the parameters
XGram usage
Uniform rates
I then ran uniformprot.eq on input.stk with all rates set to 0.00001 in the
chain.
../dart/bin/xgram --train uniform.trained --grammar uniform.eq -log 5 input.stk >& uniform.out
Nullprot iterations are:
EM iteration #1: log-likelihood = -15542.8 bits
EM iteration #2: log-likelihood = -9831.44 bits
EM iteration #3: log-likelihood = -9145.23 bits
EM iteration #4: log-likelihood = -9070.67 bits
EM iteration #5: log-likelihood = -9065.26 bits
while uniform iterations are:
EM iteration #1: log-likelihood = -17096 bits
EM iteration #2: log-likelihood = -9056.21 bits
EM iteration #3: log-likelihood = -9044.68 bits
EM iteration #4: log-likelihood = -9042.95 bits
So, same data, but slightly different likelihoods.
Using visualizeRates.pl to look at the matrices (
BubblePlots).
Note that visualizeRates.pl checks the suffix of the grammar file -
it has to be `eg`!
The results look virtually identical, as they should. I am quite amazed
at the few iterations the algorithm performs.
More iterations and pseudocounts
The improvement threshold seems to be in the significant decimals and
not the absolute threshold. Making the threshold very small, I get the
following result:
../code/dart/bin/xgram --train uniform_manyiterations.eq --grammar uniform.eq -log 6 input.stk -mi 0.000001
Warning -- used random number generator during initialisation
EM iteration #1: log-likelihood = -17096 bits
EM iteration #2: log-likelihood = -9056.21 bits
EM iteration #3: log-likelihood = -9044.68 bits
EM iteration #4: log-likelihood = -9042.95 bits
EM iteration #5: log-likelihood = -9042.49 bits
EM iteration #6: log-likelihood = -9042.28 bits
EM iteration #7: log-likelihood = -9042.19 bits
Warning: attempt to find equilibrium distribution directly from rate matrix
yielded zero or negative probabilities. Setting initial distribution from
observed counts instead. (This may have been caused by a sparse matrix;
consider using pseudocounts to avoid problems like this)
EM iteration #8: log-likelihood = -9046.28 bits
Warning: log-likelihood dropped from -9042.19 to -9046.28 bits during EM
I added one pseudocount to the initial distribution:
../code/dart/bin/xgram --train uniform_manyiterations.eq --grammar uniform.eq -log 6 input.stk -mi 0.000001 -pi 1 >/dev/null
but I get the same result. Adding 10 pseudo-counts does not change
anything either.
But if I add also 1 pseudocount for the mutations:
../code/dart/bin/xgram --train uniform_manyiterations.eq --grammar uniform.eq -log 6 input.stk -mi 0.000001 -pi 1 -pm 1 > /dev/null
Warning -- used random number generator during initialisation
EM iteration #1: log-likelihood = -17096 bits
EM iteration #2: log-likelihood = -9405.24 bits
EM iteration #3: log-likelihood = -9418.79 bits
Warning: log-likelihood dropped from -9405.24 to -9418.79 bits during EM
Failed EM improvement threshold for the 1th time; stopping
Checking post-iteration log-likelihood
Post-iteration log-likelihood = -9419.32 bits
Best log-likelihood: -9405.24
Processing alignment Alignment1 (1 of 1): 1131 columns
Annotating using grammar 'nullprot'
The pseudocounts show up quite clearly in the resultant fitted rate matrix.
Trying to use fractional pseudocounts:
../code/dart/bin/xgram --train uniform_manyiterations.eq --grammar uniform.eq -log 6 input.stk -mi 0.000001 -pi 0.1 -pm 0.1 > /dev/null
Warning -- used random number generator during initialisation
EM iteration #1: log-likelihood = -17096 bits
EM iteration #2: log-likelihood = -9086.61 bits
EM iteration #3: log-likelihood = -9077.04 bits
EM iteration #4: log-likelihood = -9075.59 bits
EM iteration #5: log-likelihood = -9075.18 bits
EM iteration #6: log-likelihood = -9075.02 bits
EM iteration #7: log-likelihood = -9074.98 bits
EM iteration #8: log-likelihood = -9074.96 bits
EM iteration #9: log-likelihood = -9074.94 bits
EM iteration #10: log-likelihood = -9074.94 bits
Failed EM improvement threshold for the 1th time; stopping
Checking post-iteration log-likelihood
Post-iteration log-likelihood = -9074.94 bits
Best log-likelihood: -9074.94
Processing alignment Alignment1 (1 of 1): 1131 columns
Annotating using grammar 'nullprot'
Again, there are many populated transitions, that were negligible before.
In particular, transitions from W and from C are highly populated. The reason:
there are probably hardly and W and C present in this multiple alignment.
Counting (I know this is clumsy):
for x in A C D E F G H I K L M N P Q R S T V W Y; do printf "%s\t%i\n" ${x} `grep "^dpse_vs" input.stk | perl -p -e "s/^\S+\s+//; s/[^${x}]//g" | wc -c`; done
A 112
C 13
D 43
E 53
F 34
G 58
H 30
I 43
K 51
L 56
M 21
N 58
P 81
Q 153
R 42
S 95
T 69
V 47
W 12
Y 35
and indeed, that is the case. Pseudocounts are dangerous, as they will affect the outcome.
Tree branch lengths
Set every branch length to 1 (tree is thus not clock-like).
../code/dart/bin/xgram --train uniform_uniform.eg --grammar uniform.eg -log 6 input_uniform_branches.stk > /dev/null
Warning -- used random number generator during initialisation
EM iteration #1: log-likelihood = -16225.4 bits
EM iteration #2: log-likelihood = -9374.27 bits
EM iteration #3: log-likelihood = -9363.39 bits
EM iteration #4: log-likelihood = -9361.05 bits
Failed EM improvement threshold for the 1th time; stopping
Checking post-iteration log-likelihood
Post-iteration log-likelihood = -9360.57 bits
Best log-likelihood: -9360.57
Processing alignment Alignment1 (1 of 1): 1131 columns
Annotating using grammar 'nullprot'
The rates are smaller, but if I scale the plot by a factor of 2, the matrices are
again indistinguishable. Again, the absolute scale of a matrix is non-informative.
I would have thought that different relative branch lengths would result in a different
matrix.
Checking out code
Buffy is the CVS server, so to get code, do
cvs -d:ext:aheger@buffy:/usr/local/cvs co transposon
projects of interest are:
- transposon: contains the visualizeRates.pl
- perl
- mercator-perl
- gfftools
Fri Sep 15, 2006 10:11 AM
No rules
The uniprot grammar updates the rules starting from initial probabilities of 0.5
for chain continuation and termination. The result is that continuation is higly probable, while
termination is rare. When I set this flag to 0,
../code/dart/bin/xgram --train uniform_norules_trained.eg --grammar uniform_norules.eg -log 6 input.stk > /dev/null
Warning -- used random number generator during initialisation
EM iteration #1: log-likelihood = -17096 bits
EM iteration #2: log-likelihood = -10176.9 bits
EM iteration #3: log-likelihood = -10165.4 bits
EM iteration #4: log-likelihood = -10163.7 bits
Failed EM improvement threshold for the 1th time; stopping
Checking post-iteration log-likelihood
Post-iteration log-likelihood = -10163.2 bits
Best log-likelihood: -10163.2
Processing alignment Alignment1 (1 of 1): 1131 columns
Annotating using grammar 'nullprot'
The likelihood is higher (no surprise here) and the fitted chains between uniform_norules_trained.eg and
uniform_trained.eg are identical as they should be.
The big run
I am using the average ks tree generated from the low-bias estimates:
python ~/t/trees2tree.py -v 0 --method=mean < ../../orthology_phylogeny/codeml_kaks_low_bias/ks.trees > input.tree
Getting and parsing the malis takes longer than the
actual rate estimation!
Note that this tree does not contain dper, dsec and dwil. Removing those sequences from
the multiple alignments.
Parsing grammar files
I thought I had a look at some automatic parser generators for python.
This is one, with a lisp style expression grammar to build on:
http://theory.stanford.edu/~amitp/Yapps/
The grammar:
parser Lisp:
ignore: '[ \t\n\r]+'
token NUM: '[0-9]+'
token ID: '[-+*/!@%^&=.a-zA-Z0-9_]+'
token STR: '"\\([^\\"]+\|\\\\.\\)*"'
rule expr: ID -> << ('id',ID) >>
| STR -> << ('str',eval(STR)) >>
| NUM -> << ('num',atoi(NUM)) >>
| list -> << list >>
rule list: "(" seq ")" -> << seq >>
rule seq: -> << [] >>
| expr seq -> << [expr] + seq >>
within << .. >> are the conversion operations to transform parts
of the grammar.
However, as Biopython is installed and has Martel, I will try to figure out
using Martel. Hmm, are "recursive" grammars actually supported by Martel? It
seems to be all LR and regular expression based? If you print the parser,
you get a single regular expression.
See
http://www.nedbatchelder.com/text/python-parsers.html
for an overview of python parsers.
Then I looked at GOLD. This has a nice GUI for writing and testing
the grammar. I had some trouble to make the Python engine run.
Had to add in
GrammarReader.py:68 the following line:
if (char == '') or (char == '\x00\x00'):
break
===> char = char.decode('utf-16')
result.append(char)
After this, the example in ./demo worked. However, on parsing LISP like
grammars, the python engine (pygold) fails. This is true for both the default LIPS grammar
and the XGram grammar that I defined. Both compile fine in GOLD and
all grammars in the dart data directory were parsed without errors.
This looks like an implementation problem in pygold. It has not been updated in
more than a year, so I abandoned it.
Instead I used Ply now. That works fine, but is a python specific solution.
Tue Sep 19, 2006 10:03 AM
Now that I have my parser, I can run it over the matrices to extract them. The
only problem is that fgu205 seems to be down.
CodonML mode for XGram
Looking at a ks grammar. How do you estimate ks? There is no branch-specific
estimation possible yet in xgram, thus only a pairwise method will work.
Ks is the number of neutral substitutions per site. Conceptually, what kind of
ks will xgram estimate?
- What is a site? Physical site or mutational opportunity?
How do you use XGram to estimate ks?
- supply a fixed tree of distance 1 for a pair of sequences.
- grammar is simply a match-gap grammar.
- estimate a rate matrix between codons
- count differences between codons
The alternative is to assume a fixed mutational process and estimate the branch
length of the tree?
CodonML uses three different models to estimate equilibrium codon frequencies:
-
- Model 0: each codon has 1/61 frequency.A
- Model 1: F3X4: codon frequency is given by the average observed nucleotide frequencies in each position
- Model 2: codon frequencies are given by the observed codon frequencies.
I will implement a script for estimating ka/ks with xgram
- calculate nucleotide frequencies at each position
- calculate codon equilibrium frequencies
- dump out a grammar for xgram
- run xgram
- post-process to find estimations
--
AndreasHeger - 11 Sep 2006