Attempting to analyse Vigenère
Previously, we looked at the Vigenère cipher and I gave some code to produce it.
I used the code to produce the following ciphertext:
LRPBB HKQBG WIYSA DEURQ LVFOT YEZKV
ELAIG DXADC TRSHB PBBZN TRTWF NSZRH
NXFCC ZVFVB DEXZS ZVABR ZRQTB CEXZG
SEFWF ZYDAB EXAWF TXZCG LRPMR EWMWQ
ASDHU ZWTCY OSGHL ZYDVN YHMBQ DAQOE
NVUSQ LXTCF LRPOE LQUGN ESZQR ZZQFP
ZQQPL PBMAC WISFH XFXWA RXAVV XWQZS
YIHSE ELQZR DWBCE ELAGF EVQHP SIPCH
ELUGU LRPOA OXTSS ZYDTE TIZRF CIBSN
EIPKV ELABR GSUQR ELQTB CQGZN OMOHN
EIPPL OEDHN RRMBN WPRCE ZRQCA PJAFN
WPFVN EWISY WRAKY PXGGR GIDMB YIDSG
TVQHB SMECJ YLAAR DEURQ LVFOT YEZOF
TJTSU LHPCA PRAHU TRSPH EGAAZ LRPOY
WLUGY TJQOA OEFHR YXUCA QSDTE ZQFVV
DQAAR YXISN CIMHS PYPKV ELFVR NEDRV
YEX
How can we begin to crack it?
The first thing we might do is look at the statistics, how many times does each letter appear?
I took some English text, and counted up all the letters to get an idea of what a monoalphabet would give.
For plain English, this is the sort of plot we might expect.
Note the big spikes for A, E, (H, I, N), O, (R, S) and T. A caesar cipher would see the same pattern of peaks, but shifted along a bit. There would be minor variations in the sizes of the peaks due to variations in the underlying text.
A monoalphabetic cipher would have similar peaks, but in different places - everything is swapped around.
To illustrate this, I took some text and applied a Caesar shift. I got the following:
DQGQR ZJHQW OHPHQ VDLGG DUWDJ QDQZL
WKRXW VWRSS LQJWR HASOD LQKLV FRQGX
FWWRS RUWKR VDOOI RURQH RQHIR UDOOW
KDWLV RXUPR WWRLV LWQRW DQGBH WVDLG
SRUWK RVKRO GRXWB RXUKD QGDQG VZHDU
FULHG DWKRV DQGDU DPLVD WRQFH RYHUF
RPHEB HADPS OHJUX PEOLQ JWRKL PVHOI
QHYHU WKHOH VVSRU WKRVV WUHWF KHGRX
WKLVK DQGDQ GWKHI RXUIU LHQGV UHSHD
WHGZL WKRQH YRLFH WKHIR UPXOD GLFWD
WHGEB GDUWD JQDQD OOIRU RQHRQ HIRUD
OOWKD WVZHO OQRZO HWXVH YHUBR QHUHW
LUHWR KLVRZ QKRPH VDLGG DUWDJ QDQDV
LIKHK DGGRQ HQRWK LQJEX WFRPP DQGDO
OKLVO LIHDQ GDWWH QWLRQ IRUIU RPWKL
VPRPH QWZHD UHDWI HXGZL WKWKH FDUGL
QDO
I then counted the letters - I got this graph. The dotted line is the expected distribution for English, the blue line is my caesar shifted text. There are lots of commonalities, but there are differences (the caesar text was shorter, so you get statistical variations - I scaled the graphs to be similar)
The pattern of peaks is not identical, but it is similar in nature.
I wrote some very dirty code to ‘slide’ these two graphs across each other, and calculate the error between them (which I defined as the root of the square of the differences). I plotted this error against the shift.
There is a VERY clear signal in Caesar shifted text - and indeed, it was a shift of 3 spaces.
If we shift the graph by that amount, we can see the match is not bad
What if we try this with text that was enciphered using the Vigenère cipher?
When we look at the statistics, the peaks seem all ‘washed out’ - there is no real match with English.
If we apply the technique of shifting, it is a dead end. This graph looks like there might be some options, but nothing is a close match.
Now, if we shift and look at the errors, we get the graph below. Nothing really stands out, the biggest discrepancy is about 25%, compared to a dip of around 80% before. This is clearly not a match.
Where to go from here?
That is a story for another day - but the first step is to establish the key length. That will be the next post on the topic. The code I used is posted below
Other posts on this topic
More posts on Cryptography
Code
This is the Python 3 code I used to look at the stats of the text:
# I'm going to plot a nice graph
import matplotlib.pyplot as plt
import numpy as np
from math import sqrt
englishdata=[18374,3418,
4678,11083,
30430,4396,
4571,17068,
13877,223,
1971,10235,
5056,15131,
17579,3353,
245,12919,
14424,21889,
5826,1798,
6737,225,
4305,93]
# grab the ciphertext
filetimetable = open("vigciphertext.txt","r")
lines=filetimetable.readlines()
filetimetable.close()
# let's make it all one string
working=""
for line in lines:
for char in line:
asc=ord(char)
if (64 < asc < 91) or ( 96 < asc < 123):
working+=char.upper()
# now let's see what we have:
x=np.arange(26)
data=np.full(26,0)
for char in working:
lett=ord(char)-65
data[lett]+=1
# lets prep the data for plotting
# first let's normalise the english data
tot=0
tot2=0
for i in range(26):
tot+=data[i]
tot2+=englishdata[i]
for i in range(26):
data[i]=data[i]*tot2/tot
plt.plot(x,data)
plt.plot(x,englishdata, marker='', color='olive', linewidth=2, linestyle='dashed', label="English")
# How frequent are x-axis gridlines?
stepsize=2
plt.xticks(np.arange(0, 25, step=stepsize))
plt.xlabel('Letter')
plt.ylabel('Matches')
plt.title('Compare to English')
plt.grid(b=True, which='major', color='#666666', linestyle='-')
# show the graph
plt.show()
error=[]
threshold=0
for shift in range(26):
err=0
for i in range(26):
err+=sqrt((data[i]-englishdata[(i+shift)%26])**2)
error.append(err)
threshold+=err
threshold=threshold/52 #26 letters, and halfway
for i in range(26):
if error[i]<threshold:
plt.text(2,error[i],'shift '+str(i)+' or '+str(26-i))
print(i,error[i])
plt.plot(x,error)
plt.show()