# Cryptanalysis Exercises

## Contents

# Cryptanalysis Exercises

More-than-slightly contrived exercises in cryptanalysis with bioinformatics equivalents.

Aimed at an introductory-level class in probability & information theory for compbio.

## Information content of a sequence

- Sliding-window entropy scan: find a substitution-enciphered message that has been steganographically padded with random letters
- c.f. Teaching.BioE131 Fall 2006 midterm , Q18 (p8)

- Blog.NumbersStations -- artificial voices reading numbers over shortwave radio, containing coded spy messages from Cuba -- does it get any better?

## Data compression

See Data Compression Exercise.

## Broken telephone tree

## Compositional analysis

- Single-character, di-character analysis
- Embed a signal in noise. Detect by compositional analysis
- More cunning embedding: try and match the letter frequencies. Countermeasure: use adjacency freqs
- Bring Bayesian scoring schemes into it somehow...

## MCMC

- This is a generative model-based approach, following the classic 1949 paper of Claude Shannon: Wikipedia:Communication_Theory_of_Secrecy_Systems
- General idea, illustrated for a Wikipedia:transposition_cipher:
- Train an order-N HMM on English (see "compositional analysis" above)
- Create an MCMC sampler over cipherkeys
- Let K=cipherkey, C=ciphertext, T=plaintext=T(K,C)
- You are then sampling from P(T,K)=P(C)*P(T(K,C)) where P(C) = 1/26! which disappears from Hastings ratio

- Options for guessing the key:
- Simplest is just to do full MCMC and take the mode
- Can also do a simulated annealing version (gradually lower the temperature)

- Lots of cheap & cheerful cryptanalysis can be done this way
- e.g. it can be used to crack the Wikipedia:Playfair_cipher

- No direct bioinformatics analogy (I think), but there ARE lots of MCMC algorithms in sequence analysis, so...

## Pairwise sequence comparison

- Pairwise homology detection: try to match plaintext with ciphertext encrypted using a running key cipher with a tabula recta
- Both key and plaintext have a known frequency distribution (English), effectively yielding a substitution matrix
- Variation: try to match two ciphertexts encrypted from the same plaintext with different running keys
- Empirical demonstration: separation of signal and noise score distributions
- In some cryptanalytic scenarios, evidence of homology may be all you need (can then apply rubber hose cryptanalysis)

- Another pairwise homology example (better analogy, but not cryptanalysis
*per se*- more like signals/traffic analysis): try to find the pair (or N/2 pairs) of matching transmissions in N noisily-intercepted transmissions- the source of randomness here is imperfect eavesdropping by Eve (e.g. line noise), rather than some cipher that can be treated as pseudo-random
- for example, could have a "substitution matrix for Morse code": dot
*vs*dash errors are like nucleotide substitutions- can introduce indels in similar ways

- can make noise dependent on "length" of wire and thereby attempt to reconstruct topology of enemy network (c.f. phylogeny)... this may be stretching things a bit far, though...

- Local alignment (ungapped): padded version of pairwise homology detection
- Opportunity for empirical demonstration of extreme value distribution (always useful)

- Another scenario for this kind of algorithm: trying to identify repeated patterns in text message traffic (SMS)
- See for example the 2005 Cronulla race riots in Sydney, Australia, and the role of SMS messages and emails in organizing the violence

## Motifs

- Motif-finding: cracking the Vigenиre/autokey cipher by Kasiski examination
- A bit of a misfit, this one: Kasiski examination involves periodicity, which doesn't fit the biological motif-finding analogy, really.

- A (possibly more realistic) motif-finding example: use profiling (Gibbs sampling, MEME, hmmer, etc.) to recognize opening keyphrases
- e.g. stylized openings, formal titles: "I have the ((very) great) honor to inform Your (Gracious) ((Most) Exalted)) Excellency that..."

## Bayesian decision theory

Would be nice to have an example where you use posterior probabilities to maximize your expected reward (i.e. decision theory, specifically expected utility).

Perhaps an example involving industrial espionage?

---

## Notes

### Resources

Historical codes & ciphers; resources

- Admiralty Secret Cypher S.P. 02105. British Cypher No. 5. (1943) "Recyphering is obligatory when using this cypher."
- It's interesting how they abandoned the 5-letter alphabetic codewords for the 4-digit numerics, e.g. 9968 instead of ZUROF (meaning: "agitators"). Could this be because the consonant-vowel-consonant-vowel-consonant pattern of the 5-letter words was a giveaway that some sort of code was in use?

- "The Literature of Intelligence" site is extremely comprehensive
- The SIGINT & Cryptography page is fantastic & has links to lots of resources

- David Kahn's homepage & books:

- Historical codes (in the proper sense of the word, i.e. not ciphers)

Journals

- Intelligence & National Security
- the sample issue contains some tantalising glimpses into a nasty world, like the Four Square Laundry story

- International Journal of Intelligence and Counter Intelligence
- have to go through some rigmarole to get the sample issue

- C0de & C1pher -- Certicom's educational newsletter

Some crypto blogs

- http://www.schneier.com/blog/ -- Bruce Schneier's blog
- http://www.lightbluetouchpaper.org/ -- Ross Anderson's blog
- https://www.financialcryptography.com/

#### Pages on Biowiki

- Teaching.InformationBiology for general links between biology & information theory.
- Teaching.RelativeEntropy for some waffle about D(p||q) and data compression
- Teaching.BrokenTelephoneTree
- Teaching.EditDistance

### Required message lengths

Some perl code to figure out how long X and Y need to be before we can reliably deduce that they are related (plaintext & ciphertext) using a running key cipher with a tabula recta & assuming English plaintext & key, vs a null hypothesis that X is English and Y is perfectly random.

The formulae for the mean & SD of the score are and where is the log-odds substitution matrix. (Note that the mean score equals the Kullback-Leibler divergence (relative entropy): . Thus the expected score for related symbols is positive.)

On the other hand, if X and Y are in fact unrelated, then their symbols are drawn from p rather than q, so the mean score per symbol is and the sd of this is . (Note that and so the average score for unrelated symbols is negative.)

The mean difference in score between a "hit" (of length L) and a random sequence-pair (also of length L) is and the s.d. of this difference is . So a message length of is required for clear signal-noise separation.

Now let's compute and . Let be cyclic subtraction (modulo the alphabet size ) and let be the background distribution. Then and .

curl http://starbase.trincoll.edu/~crypto/resources/LetFreq.html | grep TR | perl -pe 's/<\S+>//g' >lfreqs cat lfreqs | perl -e 'while(<>){@f=split;if(@f==4){$t+=$p{$f[0]}=$f[1]}} @a=sort keys%p;for$a(@a){$p{$a}/=$t}$re=0;for$i(0..@a-1){for$j(0..@a-1){ $d=$j-$i;$d+=@a if$d<0;($ai,$aj,$ad)=map($a[$_],$i,$j,$d); $q=$p{$ai}*$p{$ad};$p=$p{$ai}/@a; $qsc=log($q/$p)/log(2);$q1+=$q*$qsc;$q2+=$q*$qsc*$qsc; $p1+=$p*$qsc;$p2+=$p*$qsc*$qsc}} $qvar=$q2-$q1*$q1;$pvar=$p2-$p1*$p1; $len=4*($pvar+$qvar)/(($q1-$p1)**2); print"$len\n"

Result: about 10 characters are required.

This next one treats the case when X and Y are both ciphertexts of the same plaintext, encrypted using different running keys.

Plaintext and both keys are English. Null hypothesis: X and Y both random. So and .

cat lfreqs | perl -e 'while(<>){@f=split;if(@f==4){$t+=$p{$f[0]}=$f[1]}} @a=sort keys%p;for$a(@a){$p{$a}/=$t} for$i(0..@a-1){for$j(0..@a-1){ {$q=0;for$k(0..@a-1){$di=$i-$k;$di+=@a if$di<0;$dj=$j-$k;$dj+=@a if$dj<0; ($ai,$aj,$ak,$adi,$adj)=map($a[$_],$i,$j,$k,$di,$dj);$q+=$p{$ak}*$p{$adi}*$p{$adj}} $p=1/(@a*@a)} $qsc=log($q/$p)/log(2);$q1+=$q*$qsc;$q2+=$q*$qsc*$qsc; $p1+=$p*$qsc;$p2+=$p*$qsc*$qsc}} $qvar=$q2-$q1*$q1;$pvar=$p2-$p1*$p1; $len=4*($pvar+$qvar)/(($q1-$p1)**2); print"$len\n"

Result: about 66 chars are required. Note that the only difference between these code snippets is the calculation of p and q. The code to compute L from these is identical in both cases.

---

-- Ian Holmes - 14 Dec 2006