|
| Cryptanalysis Exercises
More-than-slightly contrived exercises in cryptanalysis with bioinformatics equivalents.
Aimed at an introductory-level class in probability & information theory for compbio.
Information content of a sequence
- NumbersStations -- artificial voices reading numbers over shortwave radio, containing coded spy messages from Cuba -- does it get any better?
Data compression
See Data Compression Exercise.
Broken telephone tree
See Broken Telephone Tree.
Compositional analysis
- Single-character, di-character analysis
- Embed a signal in noise. Detect by compositional analysis
- More cunning embedding: try and match the letter frequencies. Countermeasure: use adjacency freqs
- Bring Bayesian scoring schemes into it somehow...
MCMC
- This is a generative model-based approach, following the classic 1949 paper of Claude Shannon: Communication Theory of Secrecy Systems (Wikipedia)
- General idea, illustrated for a transposition cipher (Wikipedia) :
- Train an order-N HMM on English (see "compositional analysis" above)
- Create an MCMC sampler over cipherkeys
- Let K=cipherkey, C=ciphertext, T=plaintext=T(K,C)
- You are then sampling from P(T,K)=P(C)*P(T(K,C)) where P(C) = 1/26! which disappears from Hastings ratio
- Options for guessing the key:
- Simplest is just to do full MCMC and take the mode
- Can also do a simulated annealing version (gradually lower the temperature)
- Lots of cheap & cheerful cryptanalysis can be done this way
- e.g. it can be used to crack the Playfair cipher (Wikipedia)
- No direct bioinformatics analogy (I think), but there ARE lots of MCMC algorithms in sequence analysis, so...
Pairwise sequence comparison
- Pairwise homology detection: try to match plaintext with ciphertext encrypted using a running key cipher with a tabula recta
- Both key and plaintext have a known frequency distribution (English), effectively yielding a substitution matrix
- Variation: try to match two ciphertexts encrypted from the same plaintext with different running keys
- Empirical demonstration: separation of signal and noise score distributions
- In some cryptanalytic scenarios, evidence of homology may be all you need (can then apply rubber hose cryptanalysis)
- Another pairwise homology example (better analogy, but not cryptanalysis per se - more like signals/traffic analysis): try to find the pair (or N/2 pairs) of matching transmissions in N noisily-intercepted transmissions
- the source of randomness here is imperfect eavesdropping by Eve (e.g. line noise), rather than some cipher that can be treated as pseudo-random
- for example, could have a "substitution matrix for Morse code": dot vs dash errors are like nucleotide substitutions
- can introduce indels in similar ways
- can make noise dependent on "length" of wire and thereby attempt to reconstruct topology of enemy network (c.f. phylogeny)... this may be stretching things a bit far, though...
- Local alignment (ungapped): padded version of pairwise homology detection
- Another scenario for this kind of algorithm: trying to identify repeated patterns in text message traffic (SMS)
Motifs
- A (possibly more realistic) motif-finding example: use profiling (Gibbs sampling, MEME, hmmer, etc.) to recognize opening keyphrases
Bayesian decision theory
Would be nice to have an example where you use posterior probabilities to maximize your expected reward
(i.e. decision theory, specifically expected utility).
Perhaps an example involving industrial espionage?
Notes
Resources
Historical codes & ciphers; resources
- Admiralty Secret Cypher S.P. 02105. British Cypher No. 5. (1943) "Recyphering is obligatory when using this cypher."
- It's interesting how they abandoned the 5-letter alphabetic codewords for the 4-digit numerics, e.g. 9968 instead of ZUROF (meaning: "agitators"). Could this be because the consonant-vowel-consonant-vowel-consonant pattern of the 5-letter words was a giveaway that some sort of code was in use?
- Historical codes (in the proper sense of the word, i.e. not ciphers)
Journals
Some crypto blogs
Pages on Biowiki
Required message lengths
Some perl code to figure out how long X and Y need to be before we can reliably deduce that they are related (plaintext & ciphertext) using a running key cipher with a tabula recta & assuming English plaintext & key, vs a null hypothesis that X is English and Y is perfectly random.
The formulae for the mean & SD of the score are and
where is the log-odds substitution matrix.
(Note that the mean score equals the Kullback-Leibler divergence (relative entropy): . Thus the expected score for related symbols is positive.)
On the other hand, if X and Y are in fact unrelated, then their symbols are drawn from p rather than q, so the mean score per symbol is and the sd of this is .
(Note that and so the average score for unrelated symbols is negative.)
The mean difference in score between a "hit" (of length L) and a random sequence-pair (also of length L) is and the s.d. of this difference is .
So a message length of is required for clear signal-noise separation.
Now let's compute and .
Let be cyclic subtraction (modulo the alphabet size ) and let be the background distribution.
Then and .
curl http://starbase.trincoll.edu/~crypto/resources/LetFreq.html | grep TR | perl -pe 's/<\S+>//g' >lfreqs
cat lfreqs | perl -e 'while(<>){@f=split;if(@f==4){$t+=$p{$f[0]}=$f[1]}}
@a=sort keys%p;for$a(@a){$p{$a}/=$t}$re=0;for$i(0..@a-1){for$j(0..@a-1){
$d=$j-$i;$d+=@a if$d<0;($ai,$aj,$ad)=map($a[$_],$i,$j,$d);
$q=$p{$ai}*$p{$ad};$p=$p{$ai}/@a;
$qsc=log($q/$p)/log(2);$q1+=$q*$qsc;$q2+=$q*$qsc*$qsc;
$p1+=$p*$qsc;$p2+=$p*$qsc*$qsc}}
$qvar=$q2-$q1*$q1;$pvar=$p2-$p1*$p1;
$len=4*($pvar+$qvar)/(($q1-$p1)**2);
print"$len\n"
Result: about 10 characters are required.
This next one treats the case when X and Y are both ciphertexts of the same plaintext,
encrypted using different running keys.
Plaintext and both keys are English.
Null hypothesis: X and Y both random.
So and .
cat lfreqs | perl -e 'while(<>){@f=split;if(@f==4){$t+=$p{$f[0]}=$f[1]}}
@a=sort keys%p;for$a(@a){$p{$a}/=$t}
for$i(0..@a-1){for$j(0..@a-1){
{$q=0;for$k(0..@a-1){$di=$i-$k;$di+=@a if$di<0;$dj=$j-$k;$dj+=@a if$dj<0;
($ai,$aj,$ak,$adi,$adj)=map($a[$_],$i,$j,$k,$di,$dj);$q+=$p{$ak}*$p{$adi}*$p{$adj}}
$p=1/(@a*@a)}
$qsc=log($q/$p)/log(2);$q1+=$q*$qsc;$q2+=$q*$qsc*$qsc;
$p1+=$p*$qsc;$p2+=$p*$qsc*$qsc}}
$qvar=$q2-$q1*$q1;$pvar=$p2-$p1*$p1;
$len=4*($pvar+$qvar)/(($q1-$p1)**2);
print"$len\n"
Result: about 66 chars are required.
Note that the only difference between these code snippets is the calculation of p and q.
The code to compute L from these is identical in both cases.
-- IanHolmes - 14 Dec 2006 |