SouthParkMe

Hi.

This site tends to hold geeky things.

Attempting to analyse Vigenère

Attempting to analyse Vigenère

Previously, we looked at the Vigenère cipher and I gave some code to produce it.

I used the code to produce the following ciphertext:

LRPBB HKQBG WIYSA DEURQ LVFOT YEZKV
ELAIG DXADC TRSHB PBBZN TRTWF NSZRH
NXFCC ZVFVB DEXZS ZVABR ZRQTB CEXZG
SEFWF ZYDAB EXAWF TXZCG LRPMR EWMWQ
ASDHU ZWTCY OSGHL ZYDVN YHMBQ DAQOE
NVUSQ LXTCF LRPOE LQUGN ESZQR ZZQFP
ZQQPL PBMAC WISFH XFXWA RXAVV XWQZS
YIHSE ELQZR DWBCE ELAGF EVQHP SIPCH
ELUGU LRPOA OXTSS ZYDTE TIZRF CIBSN
EIPKV ELABR GSUQR ELQTB CQGZN OMOHN
EIPPL OEDHN RRMBN WPRCE ZRQCA PJAFN
WPFVN EWISY WRAKY PXGGR GIDMB YIDSG
TVQHB SMECJ YLAAR DEURQ LVFOT YEZOF
TJTSU LHPCA PRAHU TRSPH EGAAZ LRPOY
WLUGY TJQOA OEFHR YXUCA QSDTE ZQFVV
DQAAR YXISN CIMHS PYPKV ELFVR NEDRV
YEX

How can we begin to crack it?

The first thing we might do is look at the statistics, how many times does each letter appear?

I took some English text, and counted up all the letters to get an idea of what a monoalphabet would give.

For plain English, this is the sort of plot we might expect.

A is letter ZERO, so E is letter 4, Z is letter 25.

Note the big spikes for A, E, (H, I, N), O, (R, S) and T. A caesar cipher would see the same pattern of peaks, but shifted along a bit. There would be minor variations in the sizes of the peaks due to variations in the underlying text.

A monoalphabetic cipher would have similar peaks, but in different places - everything is swapped around.

To illustrate this, I took some text and applied a Caesar shift. I got the following:

DQGQR ZJHQW OHPHQ VDLGG DUWDJ QDQZL
WKRXW VWRSS LQJWR HASOD LQKLV FRQGX
FWWRS RUWKR VDOOI RURQH RQHIR UDOOW
KDWLV RXUPR WWRLV LWQRW DQGBH WVDLG
SRUWK RVKRO GRXWB RXUKD QGDQG VZHDU
FULHG DWKRV DQGDU DPLVD WRQFH RYHUF
RPHEB HADPS OHJUX PEOLQ JWRKL PVHOI
QHYHU WKHOH VVSRU WKRVV WUHWF KHGRX
WKLVK DQGDQ GWKHI RXUIU LHQGV UHSHD
WHGZL WKRQH YRLFH WKHIR UPXOD GLFWD
WHGEB GDUWD JQDQD OOIRU RQHRQ HIRUD
OOWKD WVZHO OQRZO HWXVH YHUBR QHUHW
LUHWR KLVRZ QKRPH VDLGG DUWDJ QDQDV
LIKHK DGGRQ HQRWK LQJEX WFRPP DQGDO
OKLVO LIHDQ GDWWH QWLRQ IRUIU RPWKL
VPRPH QWZHD UHDWI HXGZL WKWKH FDUGL
QDO

I then counted the letters - I got this graph. The dotted line is the expected distribution for English, the blue line is my caesar shifted text. There are lots of commonalities, but there are differences (the caesar text was shorter, so you get statistical variations - I scaled the graphs to be similar)

The pattern of peaks is not identical, but it is similar in nature.

I wrote some very dirty code to ‘slide’ these two graphs across each other, and calculate the error between them (which I defined as the root of the square of the differences). I plotted this error against the shift.

Looking at the error in a caesar shifted text

There is a VERY clear signal in Caesar shifted text - and indeed, it was a shift of 3 spaces.

If we shift the graph by that amount, we can see the match is not bad

A close match

What if we try this with text that was enciphered using the Vigenère cipher?

Analysing Vigènere text

When we look at the statistics, the peaks seem all ‘washed out’ - there is no real match with English.

If we apply the technique of shifting, it is a dead end. This graph looks like there might be some options, but nothing is a close match.

Now, if we shift and look at the errors, we get the graph below. Nothing really stands out, the biggest discrepancy is about 25%, compared to a dip of around 80% before. This is clearly not a match.

Note the axis - nothing really stands out

Where to go from here?

That is a story for another day - but the first step is to establish the key length. That will be the next post on the topic. The code I used is posted below

Other posts on this topic

More posts on Cryptography

Code

This is the Python 3 code I used to look at the stats of the text:


# I'm going to plot a nice graph
import matplotlib.pyplot as plt
import numpy as np
from math import sqrt

englishdata=[18374,3418,
             4678,11083,
             30430,4396,
             4571,17068,
             13877,223,
             1971,10235,
             5056,15131,
             17579,3353,
             245,12919,
             14424,21889,
             5826,1798,
             6737,225,
             4305,93]

# grab the ciphertext
filetimetable = open("vigciphertext.txt","r")
lines=filetimetable.readlines()
filetimetable.close()

# let's make it all one string
working=""
for line in lines:
    for char in line:
        asc=ord(char)
        if (64 < asc < 91) or ( 96 < asc < 123):
            working+=char.upper()

# now let's see what we have:
x=np.arange(26)            
data=np.full(26,0)

for char in working:
    lett=ord(char)-65
    data[lett]+=1

# lets prep the data for plotting

# first let's normalise the english data
tot=0
tot2=0
for i in range(26):
    tot+=data[i]
    tot2+=englishdata[i]
for i in range(26):
    data[i]=data[i]*tot2/tot

plt.plot(x,data)
plt.plot(x,englishdata, marker='', color='olive', linewidth=2, linestyle='dashed', label="English")
        
# How frequent are x-axis gridlines?
stepsize=2
plt.xticks(np.arange(0, 25, step=stepsize))
plt.xlabel('Letter')
plt.ylabel('Matches')
plt.title('Compare to English')
plt.grid(b=True, which='major', color='#666666', linestyle='-')
# show the graph
plt.show()

error=[]
threshold=0
for shift in range(26):
    err=0
    for i in range(26):
        err+=sqrt((data[i]-englishdata[(i+shift)%26])**2)
    error.append(err)
    threshold+=err

threshold=threshold/52 #26 letters, and halfway

for i in range(26):
    if error[i]<threshold:
        plt.text(2,error[i],'shift '+str(i)+' or '+str(26-i))
        print(i,error[i])

plt.plot(x,error)        
plt.show()
Bennu Dry Run

Bennu Dry Run

Arithmetic Tricks - Squaring

Arithmetic Tricks - Squaring