# Cryptanalysis Exercises

More-than-slightly contrived exercises in cryptanalysis with bioinformatics equivalents.

Aimed at an introductory-level class in probability & information theory for compbio.

## Information content of a sequence

• Blog.NumbersStations -- artificial voices reading numbers over shortwave radio, containing coded spy messages from Cuba -- does it get any better?

## Compositional analysis

• Single-character, di-character analysis
• Embed a signal in noise. Detect by compositional analysis
• More cunning embedding: try and match the letter frequencies. Countermeasure: use adjacency freqs
• Bring Bayesian scoring schemes into it somehow...

## MCMC

• This is a generative model-based approach, following the classic 1949 paper of Claude Shannon: Wikipedia:Communication_Theory_of_Secrecy_Systems
• General idea, illustrated for a Wikipedia:transposition_cipher:
• Train an order-N HMM on English (see "compositional analysis" above)
• Create an MCMC sampler over cipherkeys
• Let K=cipherkey, C=ciphertext, T=plaintext=T(K,C)
• You are then sampling from P(T,K)=P(C)*P(T(K,C)) where P(C) = 1/26! which disappears from Hastings ratio
• Options for guessing the key:
• Simplest is just to do full MCMC and take the mode
• Can also do a simulated annealing version (gradually lower the temperature)
• Lots of cheap & cheerful cryptanalysis can be done this way
• No direct bioinformatics analogy (I think), but there ARE lots of MCMC algorithms in sequence analysis, so...

## Pairwise sequence comparison

• Pairwise homology detection: try to match plaintext with ciphertext encrypted using a running key cipher with a tabula recta
• Both key and plaintext have a known frequency distribution (English), effectively yielding a substitution matrix
• Variation: try to match two ciphertexts encrypted from the same plaintext with different running keys
• Empirical demonstration: separation of signal and noise score distributions
• In some cryptanalytic scenarios, evidence of homology may be all you need (can then apply rubber hose cryptanalysis)
• Another pairwise homology example (better analogy, but not cryptanalysis per se - more like signals/traffic analysis): try to find the pair (or N/2 pairs) of matching transmissions in N noisily-intercepted transmissions
• the source of randomness here is imperfect eavesdropping by Eve (e.g. line noise), rather than some cipher that can be treated as pseudo-random
• for example, could have a "substitution matrix for Morse code": dot vs dash errors are like nucleotide substitutions
• can introduce indels in similar ways
• can make noise dependent on "length" of wire and thereby attempt to reconstruct topology of enemy network (c.f. phylogeny)... this may be stretching things a bit far, though...
• Local alignment (ungapped): padded version of pairwise homology detection

## Bayesian decision theory

Would be nice to have an example where you use posterior probabilities to maximize your expected reward (i.e. decision theory, specifically expected utility).

Perhaps an example involving industrial espionage?

---

## Notes

### Resources

Historical codes & ciphers; resources

• Admiralty Secret Cypher S.P. 02105. British Cypher No. 5. (1943) "Recyphering is obligatory when using this cypher."
• It's interesting how they abandoned the 5-letter alphabetic codewords for the 4-digit numerics, e.g. 9968 instead of ZUROF (meaning: "agitators"). Could this be because the consonant-vowel-consonant-vowel-consonant pattern of the 5-letter words was a giveaway that some sort of code was in use?
• Historical codes (in the proper sense of the word, i.e. not ciphers)

Journals

Some crypto blogs

#### Pages on Biowiki

• Teaching.InformationBiology for general links between biology & information theory.
• Teaching.RelativeEntropy for some waffle about D(p||q) and data compression
• Teaching.BrokenTelephoneTree
• Teaching.EditDistance

### Required message lengths

Some perl code to figure out how long X and Y need to be before we can reliably deduce that they are related (plaintext & ciphertext) using a running key cipher with a tabula recta & assuming English plaintext & key, vs a null hypothesis that X is English and Y is perfectly random.

The formulae for the mean & SD of the score are $\mu_q = \langle S_{ij} \rangle_q$ and $\sigma_q = \langle S^2_{ij} \rangle_q - \mu^2$ where $S_{ij} = \log_2 \frac{q_{ij}}{p_{ij}}$ is the log-odds substitution matrix. (Note that the mean score equals the Kullback-Leibler divergence (relative entropy): $\mu_q = D(q||p)$. Thus the expected score for related symbols is positive.)

On the other hand, if X and Y are in fact unrelated, then their symbols are drawn from p rather than q, so the mean score per symbol is $\mu_p$ and the sd of this is $\sigma_p$. (Note that $\mu_p = -D(p||q)$ and so the average score for unrelated symbols is negative.)

The mean difference in score between a "hit" (of length L) and a random sequence-pair (also of length L) is $(\mu_q-\mu_p) L$ and the s.d. of this difference is $\sqrt{L(\sigma_p^2+\sigma_q^2)}$. So a message length of $L \sim 4 (\sigma_p^2+\sigma_q^2) / (\mu_q-\mu_p)^2$ is required for clear signal-noise separation.

Now let's compute $p_{ij}$ and $q_{ij}$. Let $i-j$ be cyclic subtraction (modulo the alphabet size $A$) and let $b_i$ be the background distribution. Then $q_{ij} = b_i b_{i-j}$ and $p_{ij} = b_i/A$.

curl http://starbase.trincoll.edu/~crypto/resources/LetFreq.html | grep TR | perl -pe 's/<\S+>//g' >lfreqs

cat lfreqs | perl -e 'while(<>){@f=split;if(@f==4){$t+=$p{$f[0]}=$f[1]}}
@a=sort keys%p;for$a(@a){$p{$a}/=$t}$re=0;for$i(0..@a-1){for$j(0..@a-1){$d=$j-$i;$d+=@a if$d<0;($ai,$aj,$ad)=map($a[$_],$i,$j,$d);
$q=$p{$ai}*$p{$ad};$p=$p{$ai}/@a;
$qsc=log($q/$p)/log(2);$q1+=$q*$qsc;$q2+=$q*$qsc*$qsc;
$p1+=$p*$qsc;$p2+=$p*$qsc*$qsc}}$qvar=$q2-$q1*$q1;$pvar=$p2-$p1*$p1;$len=4*($pvar+$qvar)/(($q1-$p1)**2);
print"$len\n"  Result: about 10 characters are required. This next one treats the case when X and Y are both ciphertexts of the same plaintext, encrypted using different running keys. Plaintext and both keys are English. Null hypothesis: X and Y both random. So $q_{ij} = \sum_k b_k b_{i-k} b_{j-k}$ and $p_{ij} = 1/A^2$.  cat lfreqs | perl -e 'while(<>){@f=split;if(@f==4){$t+=$p{$f[0]}=$f[1]}} @a=sort keys%p;for$a(@a){$p{$a}/=$t} for$i(0..@a-1){for$j(0..@a-1){ {$q=0;for$k(0..@a-1){$di=$i-$k;$di+=@a if$di<0;$dj=$j-$k;$dj+=@a if$dj<0; ($ai,$aj,$ak,$adi,$adj)=map($a[$_],$i,$j,$k,$di,$dj);$q+=$p{$ak}*$p{$adi}*$p{$adj}}
$p=1/(@a*@a)}$qsc=log($q/$p)/log(2);$q1+=$q*$qsc;$q2+=$q*$qsc*$qsc;$p1+=$p*$qsc;$p2+=$p*$qsc*$qsc}}
$qvar=$q2-$q1*$q1;$pvar=$p2-$p1*$p1;
$len=4*($pvar+$qvar)/(($q1-$p1)**2); print"$len\n"


Result: about 66 chars are required. Note that the only difference between these code snippets is the calculation of p and q. The code to compute L from these is identical in both cases.

---

-- Ian Holmes - 14 Dec 2006