Solving substitution ciphers with Markov chains in Python

July 28, 2019

A simple substitution cipher like a Caesar cipher or ROT13 substitutes each letter in the original message with a specific letter, e.g. replacing all A's in the original message with N's. So a message like:

TO BE OR NOT TO BE

becomes:

LW UO WQ PWL LW UO

We can break these ciphers using some basic natural language processing, exploiting statistical properties of language. If you know some basic cryptography you might be familiar with the idea of using letter frequencies for this kind of code-breaking, e.g. the fact that 'e' is the most common letter in English. We'll use letter frequencies here, but we'll also add bigram frequencies (pairs of letters), as used in Chen and Rosenthal (2012).

By using bigram frequencies to evaluate the likelihood that text belongs to English, we're assuming that it resembles a Markov process - when c1 and c2 appear together, we only condition on c1 to get an estimate of how likely c2 was to appear, rather than the entire text up to that point. Since we're only considering one previous letter, it's a first-order Markov process.

Getting some text (or maybe a lot)

We'll start by setting up some sources of text, both to encrypt/decrypt and to estimate the letter and bigram frequencies from. For both uses, it helps to have reasonably large amounts of text (while testing this method I found it didn't always work well on short cipher texts). We'll use some books chosen from the most popular downloads on Project Gutenberg.

First let's get some basic libraries and functions set up:

import collections
from itertools import islice
import math
import random
import string

import pandas as pd
import matplotlib.pyplot as plt

# Helper function we'll use to preview dictionaries
def take(n, iterable):
    return list(islice(iterable, n))

# Get all the text from a file, converting to lowercase
#   and removing everything except letters and spaces
def get_simple_text(filename):
    with open(filename) as file_in:
        all_text = file_in.read()
    # Remove all punctuation + numbers
    out = []
    for c in all_text.lower():
        # Make sure we don't add double spaces
        if c in (' ', '\n', '\t', '\r'):
            if out and out[-1] != ' ':
                out.append(' ')
        if c in string.ascii_lowercase:
            out.append(c)
    return ''.join(out)

Now we'll read in our two books: Alice's Adventures in Wonderland and Frankenstein.

alice = get_simple_text('alice.txt')
frank = get_simple_text('frank.txt')
print(alice[1012:1105])
print(frank[3000:3100])

the hot day made her feel very sleepy and stupid whether the pleasure of making a daisychain 
rks in a little boat with his holiday mates on an expedition of discovery up his native river but su

After that, it's time to set up an encryption key. We'll map each letter in the alphabet randomly to another letter:

def make_random_key():
    out_letters = list(string.ascii_lowercase)
    random.shuffle(out_letters)
    key = dict(zip(string.ascii_lowercase, out_letters))
    return key

encrypt_key = make_random_key()
take(5, encrypt_key.items())

[('a', 'b'), ('b', 'a'), ('c', 'c'), ('d', 'f'), ('e', 'u')]

And we'll encrypt Alice's Adventures in Wonderland:

# encrypting and decrypting both work the same
#   way except the key is reversed, so we won't
#   define a separate encrypt method
def decrypt(code, key):
    trans = str.maketrans(key)
    return code.translate(trans)

alice_encrypted = decrypt(alice, encrypt_key)
alice_encrypted[1012:1105]

'wlu lpw fbn ibfu luy juuh suyn xhuutn bvf xwztof kluwluy wlu thubxzyu pj ibeovq b fboxnclbov '

The probability of a string

Now we'll use Frankenstein to estimate English letter and bigram probabilities.

For letter probabilities, all we have to do is count the number of times each letter appears, and divide by the total to convert into a probability.

For bigram probabilities, for each letter c1, we count all the letters that immediately follow the letter c2, and then convert to a probability p(c1, c2) = count(c1, c2) / total(c1).

def make_letter_probs(text):
    counts = collections.Counter(text)
    total = sum(counts.values())
    
    probs = {}
    for c in counts:
        # Ignore space
        if c == ' ':
            continue
        probs[c] = counts[c] / total
    return probs

def make_bigram_probs(text):
    freqs = collections.defaultdict(collections.Counter)
    for c1, c2 in zip(text[:-1], text[1:]):
        freqs[c1][c2] += 1
    
    prob_table = collections.defaultdict(dict)
    for c1, c1_counts in freqs.items():
        total = sum(c1_counts.values())
        for c2, freq in c1_counts.items():
            prob_table[c1][c2] = freq / total
    return prob_table

letter_probs = make_letter_probs(frank)
take(5, letter_probs.items())

[('p', 0.014430786931051081),
 ('r', 0.048974512497211756),
 ('o', 0.059235257516524024),
 ('j', 0.0011833902722502025),
 ('e', 0.1081557660925815)]

bigram_probs = make_bigram_probs(frank)
take(5, bigram_probs['a'].items())

[('n', 0.21939004335476156),
 ('r', 0.09900583046793243),
 ('f', 0.009904320526237105),
 ('t', 0.14654656899387053),
 ('l', 0.06701300642846464)]

We can use a quick plot to check how these bigram probabilities look. You can see some basic patterns, like how 'q' is almost always followed by 'u', and letters like 'v' and 'j' are mostly followed by vowels:

bi_df = pd.DataFrame.from_dict(bigram_probs)
bi_df.sort_index(axis = 'columns', inplace=True)

fig, ax = plt.subplots()
im = ax.imshow(bi_df, cmap=plt.cm.viridis, vmin=0, vmax=1)
fig.colorbar(im)
ax.set_title("Bigram probabilities")
ax.xaxis.tick_top()
ax.set_xlabel('First letter')
ax.set_ylabel('Second letter')
ax.xaxis.set_label_position('top')
ax.set_xticks(range(27))
ax.set_yticks(range(27))
ax.set_xticklabels(' ' + string.ascii_lowercase)
ax.set_yticklabels(' ' + string.ascii_lowercase)
fig.set_size_inches((9, 7))

We'll use our letter and bigram probabilities to score the likelihood of a piece of text being English (at least, English as it appears in Frankenstein). For each letter in the text we're scoring, we'll add the letter and bigram probability together (with optional weights). The overall likelihood would be the product of all these probabilities, which would get incredibly small very quickly, so instead we use the sum of the log probabilities.

The actual likelihood value we see is pretty arbitrary, the important thing is that the bigger it is, the more likely the text is to be English.

def score_text(text, letter_probs, bigram_probs,
               letter_weight=1.0,
               bigram_weight=1.0):
    # Normalise weights to sum to 1
    total_weight = letter_weight + bigram_weight
    letter_weight = letter_weight / total_weight
    bigram_weight = bigram_weight / total_weight
    
    total_logprob = 0
    for c1, c2 in zip(text[:-1], text[1:]):
        # Use a default of 1 for letter prob, basically
        #   ignore spaces
        letter_prob = letter_probs.get(c1, 1)
        bigram_prob = bigram_probs[c1].get(c2, 0.001)
        total_logprob += math.log(
            letter_weight * letter_prob +
            bigram_weight * bigram_prob
        )
            
    return total_logprob

score_text(alice_encrypted, letter_probs, bigram_probs)

-510545.2674804321

Solving the cipher

With all these pieces in place, we can start trying to decrypt the text. We'll start with a random decryption key, and randomly swap letters around to see if we get an improvement in the decrypted text's score.

As we go we'll print out the current decryption of the text to see our progress. Even when we don't have the right answer, you should be able to see the text becoming more "English-y" as we go:

letter_weight = 1.0
bigram_weight = 1.0
iterations = int(1e4)
print_every = 1000

decrypt_key = make_random_key()
best_decrypt = decrypt(alice_encrypted, decrypt_key)
best_score = score_text(best_decrypt, letter_probs, bigram_probs,
                        letter_weight = letter_weight,
                        bigram_weight = bigram_weight)

for iter_num in range(iterations):
    a, b = random.choices(string.ascii_lowercase, k=2)
    # Swap two letters
    decrypt_key[a], decrypt_key[b] = decrypt_key[b], decrypt_key[a]
    current_decrypt = decrypt(alice_encrypted, decrypt_key)
    new_score = score_text(current_decrypt, letter_probs, bigram_probs,
                           letter_weight = letter_weight,
                           bigram_weight = bigram_weight)
    if new_score > best_score:
        best_score = new_score
    else:
        # Swap back
        decrypt_key[a], decrypt_key[b] = decrypt_key[b], decrypt_key[a]
    # Check progress
    if iter_num % print_every == 0:
        print('{n}: {d}'.format(n=iter_num,
                                d=current_decrypt[1012:1105]))
print(current_decrypt[1012:1105])

0: phu hwp fay jafu hub suuo euby couuxy azf cpkxif dhuphub phu xouackbu ws janizg a faicythaiz 
1000: the hot day zade her lees very fseepy and ftupid whether the pseafure ol zaking a daifychain 
2000: the hot day made her feew very sweepy and stupid lhether the pweasure of making a daisychain 
3000: the hot day made her leef very sfeepy and stupid whether the pfeasure ol making a daisychain 
4000: ths hot day mads hsr fssl vsry elsspy and etupid whsthsr ths plsaeurs of making a daieychain 
5000: the hot day made hej feel vejy sleepy and stupid whethej the pleasuje of making a daisychain 
6000: thc hot day madc hcr fccl vcry slccpy and stupid whcthcr thc plcasurc of making a daisyehain 
7000: the hot dak made her feel verk sleepk and stupid whether the pleasure of maying a daiskchain 
8000: the hot day made her feel very sleepy and stupid whether the pleasure of making a daisychain 
9000: the hot day nade her feel very sleepy amd stupid whether the pleasure of nakimg a daisychaim 
the hot day made her feel very sleepy and stupcd whether the pleasure of makcng a dacsyihacn

And that's it! This is a random process so it may not find the correct answer every time, and might even move away after finding the right answer, but it should be clear that it generally moves in the direction of "more like English". We can check our final answer against the original encryption key we used:

def reverse_key(key):
    return {c2: c1 for c1, c2 in key.items()}

true_decrypt_key = reverse_key(encrypt_key)
decrypt_key == true_decrypt_key

True

Because this method is so simple, it's easy to see multiple different ways we can tweak it to try for better accuracy and efficiency:

Varying the weights on letters vs. bigrams.
Using multiple texts to build up our letter and bigram probabilities to avoid the quirks of any one text.
Extending to include trigrams or even longer sequences and whole words.
Testing exactly how much text is needed to get good performance: shorter cipher texts are quicker to decrypt and score, but may not give as good accuracy since their score will have greater variance.