# Andreas Heger LJXRate On Flies Orthologs

## Contents

# Calculating amino acid substitution rates

## Mon Sep 11, 2006 11:46 PM

Toying around with XGram. Selected a multiple alignment from the full species set (cluster 5052_1) and extract the sequences:

perl ~/gpipe/extract_fasta.pl <(grep 5052_1 ~/projects/flies/release1v5/orthology_malis/full_species.map) \ < ~/projects/flies/release1v5/orthology_malis/step1.dir/cluster_5052.dir/cluster_5052.bl_mali > input.fasta

Then, translating everything to amino acids, removing gene ids and converting to Stockholm format:

python ~/t/sequence2sequence.py --method=translate < input.fasta |\ perl -p -e "s/[|]\S+//" |\ python ~/t/mali2mali.py -i fasta -o stockholm -v 0 > input.stk

Seems that I have picked an alignment with plenty of polyQs. Might as well....Also, the sequences are highly similar, more than 84% identity.

Adding the tree later, will first see about grammar.

## Wed Sep 13, 2006 4:09 PM

Found the problem with gcc4.1 and DART:

- extra qualifiers
- injected friend declarations
- non-template member functions in non-template classes inheriting from a template class.

The following page has all the info: http://womble.decadentplace.org.uk/c++/syntax-errors.html

## Thu Sep 14, 2006 11:40 AM

Running nullprot.eq on input.stk. I input the median ks tree, the output is a new grammar with changed rates and an unchanged tree.

I played a bit around with the parameters

### XGram usage

#### Uniform rates

I then ran uniformprot.eq on input.stk with all rates set to 0.00001 in the chain.

../dart/bin/xgram --train uniform.trained --grammar uniform.eq -log 5 input.stk >& uniform.out

Nullprot iterations are:

EM iteration #1: log-likelihood = -15542.8 bits EM iteration #2: log-likelihood = -9831.44 bits EM iteration #3: log-likelihood = -9145.23 bits EM iteration #4: log-likelihood = -9070.67 bits EM iteration #5: log-likelihood = -9065.26 bits

while uniform iterations are:

EM iteration #1: log-likelihood = -17096 bits EM iteration #2: log-likelihood = -9056.21 bits EM iteration #3: log-likelihood = -9044.68 bits EM iteration #4: log-likelihood = -9042.95 bits

So, same data, but slightly different likelihoods.

Using visualizeRates.pl to look at the matrices (Bubble Plots). Note that visualizeRates.pl checks the suffix of the grammar file - it has to be `eg`!

The results look virtually identical, as they should. I am quite amazed at the few iterations the algorithm performs.

#### More iterations and pseudocounts

The improvement threshold seems to be in the significant decimals and not the absolute threshold. Making the threshold very small, I get the following result:

../code/dart/bin/xgram --train uniform_manyiterations.eq --grammar uniform.eq -log 6 input.stk -mi 0.000001 Warning -- used random number generator during initialisation EM iteration #1: log-likelihood = -17096 bits EM iteration #2: log-likelihood = -9056.21 bits EM iteration #3: log-likelihood = -9044.68 bits EM iteration #4: log-likelihood = -9042.95 bits EM iteration #5: log-likelihood = -9042.49 bits EM iteration #6: log-likelihood = -9042.28 bits EM iteration #7: log-likelihood = -9042.19 bits Warning: attempt to find equilibrium distribution directly from rate matrix yielded zero or negative probabilities. Setting initial distribution from observed counts instead. (This may have been caused by a sparse matrix; consider using pseudocounts to avoid problems like this) EM iteration #8: log-likelihood = -9046.28 bits Warning: log-likelihood dropped from -9042.19 to -9046.28 bits during EM

I added one pseudocount to the initial distribution:

../code/dart/bin/xgram --train uniform_manyiterations.eq --grammar uniform.eq -log 6 input.stk -mi 0.000001 -pi 1 >/dev/null

but I get the same result. Adding 10 pseudo-counts does not change anything either.

But if I add also 1 pseudocount for the mutations:

../code/dart/bin/xgram --train uniform_manyiterations.eq --grammar uniform.eq -log 6 input.stk -mi 0.000001 -pi 1 -pm 1 > /dev/null Warning -- used random number generator during initialisation EM iteration #1: log-likelihood = -17096 bits EM iteration #2: log-likelihood = -9405.24 bits EM iteration #3: log-likelihood = -9418.79 bits Warning: log-likelihood dropped from -9405.24 to -9418.79 bits during EM Failed EM improvement threshold for the 1th time; stopping Checking post-iteration log-likelihood Post-iteration log-likelihood = -9419.32 bits Best log-likelihood: -9405.24 Processing alignment Alignment1 (1 of 1): 1131 columns Annotating using grammar 'nullprot'

The pseudocounts show up quite clearly in the resultant fitted rate matrix.

Trying to use fractional pseudocounts:

../code/dart/bin/xgram --train uniform_manyiterations.eq --grammar uniform.eq -log 6 input.stk -mi 0.000001 -pi 0.1 -pm 0.1 > /dev/null Warning -- used random number generator during initialisation EM iteration #1: log-likelihood = -17096 bits EM iteration #2: log-likelihood = -9086.61 bits EM iteration #3: log-likelihood = -9077.04 bits EM iteration #4: log-likelihood = -9075.59 bits EM iteration #5: log-likelihood = -9075.18 bits EM iteration #6: log-likelihood = -9075.02 bits EM iteration #7: log-likelihood = -9074.98 bits EM iteration #8: log-likelihood = -9074.96 bits EM iteration #9: log-likelihood = -9074.94 bits EM iteration #10: log-likelihood = -9074.94 bits Failed EM improvement threshold for the 1th time; stopping Checking post-iteration log-likelihood Post-iteration log-likelihood = -9074.94 bits Best log-likelihood: -9074.94 Processing alignment Alignment1 (1 of 1): 1131 columns Annotating using grammar 'nullprot'

Again, there are many populated transitions, that were negligible before. In particular, transitions from W and from C are highly populated. The reason: there are probably hardly and W and C present in this multiple alignment.

Counting (I know this is clumsy):

for x in A C D E F G H I K L M N P Q R S T V W Y; do printf "%s\t%i\n" ${x} `grep "^dpse_vs" input.stk | perl -p -e "s/^\S+\s+//; s/[^${x}]//g" | wc -c`; done A 112 C 13 D 43 E 53 F 34 G 58 H 30 I 43 K 51 L 56 M 21 N 58 P 81 Q 153 R 42 S 95 T 69 V 47 W 12 Y 35

and indeed, that is the case. Pseudocounts are dangerous, as they will affect the outcome.

#### Tree branch lengths

Set every branch length to 1 (tree is thus not clock-like).

../code/dart/bin/xgram --train uniform_uniform.eg --grammar uniform.eg -log 6 input_uniform_branches.stk > /dev/null Warning -- used random number generator during initialisation EM iteration #1: log-likelihood = -16225.4 bits EM iteration #2: log-likelihood = -9374.27 bits EM iteration #3: log-likelihood = -9363.39 bits EM iteration #4: log-likelihood = -9361.05 bits Failed EM improvement threshold for the 1th time; stopping Checking post-iteration log-likelihood Post-iteration log-likelihood = -9360.57 bits Best log-likelihood: -9360.57 Processing alignment Alignment1 (1 of 1): 1131 columns Annotating using grammar 'nullprot'

The rates are smaller, but if I scale the plot by a factor of 2, the matrices are again indistinguishable. Again, the absolute scale of a matrix is non-informative.

I would have thought that different relative branch lengths would result in a different matrix.

### Checking out code

Buffy is the CVS server, so to get code, do

cvs -d:ext:aheger@buffy:/usr/local/cvs co transposon

projects of interest are:

- transposon: contains the visualizeRates.pl
- perl
- mercator-perl
- gfftools

## Fri Sep 15, 2006 10:11 AM

### No rules

The uniprot grammar updates the rules starting from initial probabilities of 0.5 for chain continuation and termination. The result is that continuation is higly probable, while termination is rare. When I set this flag to 0,

../code/dart/bin/xgram --train uniform_norules_trained.eg --grammar uniform_norules.eg -log 6 input.stk > /dev/null Warning -- used random number generator during initialisation EM iteration #1: log-likelihood = -17096 bits EM iteration #2: log-likelihood = -10176.9 bits EM iteration #3: log-likelihood = -10165.4 bits EM iteration #4: log-likelihood = -10163.7 bits Failed EM improvement threshold for the 1th time; stopping Checking post-iteration log-likelihood Post-iteration log-likelihood = -10163.2 bits Best log-likelihood: -10163.2 Processing alignment Alignment1 (1 of 1): 1131 columns Annotating using grammar 'nullprot'

The likelihood is higher (no surprise here) and the fitted chains between uniform_norules_trained.eg and uniform_trained.eg are identical as they should be.

### The big run

I am using the average ks tree generated from the low-bias estimates:

python ~/t/trees2tree.py -v 0 --method=mean < ../../orthology_phylogeny/codeml_kaks_low_bias/ks.trees > input.tree

Getting and parsing the malis takes longer than the actual rate estimation!

Note that this tree does not contain dper, dsec and dwil. Removing those sequences from the multiple alignments.

### Parsing grammar files

I thought I had a look at some automatic parser generators for python.

This is one, with a lisp style expression grammar to build on:

http://theory.stanford.edu/~amitp/Yapps/

The grammar:

parser Lisp: ignore: '[ \t\n\r]+' token NUM: '[0-9]+' token ID: '[-+*/!@%^&=.a-zA-Z0-9_]+' token STR: '"\\([^\\"]+\|\\\\.\\)*"' rule expr: ID -> << ('id',ID) >> | STR -> << ('str',eval(STR)) >> | NUM -> << ('num',atoi(NUM)) >> | list -> << list >> rule list: "(" seq ")" -> << seq >> rule seq: -> << [] >> | expr seq -> << [expr] + seq >>

within << .. >> are the conversion operations to transform parts of the grammar.

However, as Biopython is installed and has Martel, I will try to figure out using Martel. Hmm, are "recursive" grammars actually supported by Martel? It seems to be all LR and regular expression based? If you print the parser, you get a single regular expression.

See http://www.nedbatchelder.com/text/python-parsers.html for an overview of python parsers.

Then I looked at GOLD. This has a nice GUI for writing and testing the grammar. I had some trouble to make the Python engine run.

Had to add in Grammar Reader.py:68 the following line:

if (char <tt>= '') or (char =</tt> '\x00\x00'): break ===> char = char.decode('utf-16') result.append(char)

After this, the example in ./demo worked. However, on parsing LISP like grammars, the python engine (pygold) fails. This is true for both the default LIPS grammar and the XGram grammar that I defined. Both compile fine in GOLD and all grammars in the dart data directory were parsed without errors.

This looks like an implementation problem in pygold. It has not been updated in more than a year, so I abandoned it.

Instead I used Ply now. That works fine, but is a python specific solution.

## Tue Sep 19, 2006 10:03 AM

Now that I have my parser, I can run it over the matrices to extract them. The only problem is that fgu205 seems to be down.

### Codon ML mode for XGram

Looking at a ks grammar. How do you estimate ks? There is no branch-specific estimation possible yet in xgram, thus only a pairwise method will work.

Ks is the number of neutral substitutions per site. Conceptually, what kind of ks will xgram estimate?

- What is a site? Physical site or mutational opportunity?

How do you use XGram to estimate ks?

- supply a fixed tree of distance 1 for a pair of sequences.
- grammar is simply a match-gap grammar.
- estimate a rate matrix between codons
- count differences between codons

The alternative is to assume a fixed mutational process and estimate the branch length of the tree?

Codon ML uses three different models to estimate equilibrium codon frequencies:

- Model 0: each codon has 1/61 frequency.A
- Model 1: F3X4: codon frequency is given by the average observed nucleotide

frequencies in each position

- Model 2: codon frequencies are given by the observed codon frequencies.

I will implement a script for estimating ka/ks with xgram

- calculate nucleotide frequencies at each position
- calculate codon equilibrium frequencies
- dump out a grammar for xgram
- run xgram
- post-process to find estimations

-- Andreas Heger - 11 Sep 2006