A MATHEMATICAL / STATISTICAL TECHNIQUE

FOR VOWEL ISOLATION

IN SIMPLE-SUBSTITUTION CIPHERTEXTS

Donald R. Burleson, Ph.D.

Copyright © 2007 by Donald R. Burleson. All rights reserved.

(To go to the "About the Author" page, click here.
To go to the SITE MAP, click here.)


In 1989 I discovered a new algorithm for isolating vowels in a simple-substitution cryptogram. As a member of the American Cryptogram Association I published an article ("The ‘MACC’—A Statistical Technique for Vowel Islolation," by-lined as CTHULHU, my member "nom" in the ACA) in the March-April 1989 issue of the official ACA journal The Cryptogram. During the intervening years, I have tested the method extensively, finding it quite satisfactory, but until now (January 2007) I have not written about the algorithm for a broader audience.

To illustrate the method here, I will use a cryptographic problem (number A-20) from the September-October 2006 issue of The Cryptogram, on which my vowel isolation algorithm turns out to work perfectly; the technique does not always work with perfection, but in general it works well enough to assist greatly in the solution of such difficult cryptograms as the one used for illustration here, which was composed by ACA member PETROUSHKA and is reprinted here by permission of the American Cryptogram Association:

AQNIPKWG VZIXUQKY SNBAZTXKI, PNBLTZQ SBAJXUI YSZM VNWKQZWG KUIBWPZ TBAKYXW UZKPSJNL, PLXQKMG ANYKXW MZXAQ AYZUXLKN.

To apply my technique, which I call MACC (the "Mean Associated Contact Count"), I first do a frequency count of the ciphertext letters and list them from highest to lowest frequency of occurrence:

(10) K; (9) Z; (8) X; (7) A, N; (6) Q, W; (5) B, I, P, U, Y; (4) L, S; (3) G, M, T; (2) J, V.

Next for each ciphertext letter I list the letters that the given letter contacts (stands adjacent to), and I record the number of such adjacent letters as the given letter’s "variety of contact count" or VCC. This gives VCC values of:

K(12), Z(13), X(11), A(8), N(12), Q(6), W(6), B(7), I(7), P(7), U(5), Y(6), L(6), S(6), G(2), M(3), T(4), J(4), V(2).

Then, in each letter’s adjacent-letter list, I record each adjacent letter’s own VCC and add the VCCs for each list, then divide this total by the root letter’s own VCC to produce an average, the Mean Associated Contact Count (MACC, the average number of contacts a given letter’s contacts themselves have), computed for each separate letter:

For K: P(7), W(6), Q(6), Y(6), X(11), I(7), U(5), A(8), Z(13), M(3), L(6), N(12). TOTAL 90. MACC = 90 / 12 = 7.500.

For Z: V(2), I(7), A(8), T(4), Q(6), S(6), M(3), W(6), P(7), U(5), K(12), X(11), Y(6). TOTAL 83. MACC = 83 / 13 = 6.385.

For X: I(7), U(5), T(4), K(12), X(11), Y(6), W(6), L(6), Q(6), Z(13), A(8). TOTAL 77. MACC = 77 / 11 = 7.000.

For A: Q(6), B(7), Z(13), J(4), K(12), N(12), X(11), Y(6). TOTAL 71. MACC = 71 / 8 = 8.875.

For N: Q(6), I(7), S(6), B(7), P(7), V(2), W(6), J(4), L(6), A(8), Y(6), K(12). TOTAL 77. MACC = 77 / 12 = 6.417.

For Q: A(8), N(12), U(5), K(12), Z(13), X(11). TOTAL 61. MACC = 61 / 6 = 10.167.

For W: K(12), G(2), N(12), B(7), P(7), X(11). TOTAL 51. MACC = 51 / 6 = 8.500.

For B: N(12), A(8), L(6), S(6), I(7), W(6), T(4). TOTAL 49. MACC = 49 / 7 = 7.000.

For I: N(12), P(7), Z(13), X(11), K(12), U(5), B(7). TOTAL 67. MACC = 67 / 7 = 9.571.

For P: I(7), K(12), N(12), W(6), Z(13), S(6), L(6). TOTAL 62. MACC = 62 / 7 = 8.857.

For U: X(11), Q(6), I(7), K(12), Z(13). TOTAL 49. MACC = 49 / 5 = 9.800.

For Y: K(12), S(6), X(11), N(12), A(8), Z(13). TOTAL 62. MACC = 62 / 6 = 10.333.

For L: B(7), T(4), N(12), P(7), X(11), K(12). TOTAL 53. MACC = 53 / 6 = 8.833.

For S: N(12), B(7), Y(6), Z(13), P(7), J(4). TOTAL 49. MACC = 49 / 6 = 8.167.

For G: W(6), M(3). TOTAL 9. MACC = 9 / 2 = 4.500.

For M: Z(13), K(12), G(2). TOTAL 27. MACC = 27 / 3 = 9.000.

For T: Z(13), X(11), L(6), B(7). TOTAL 37. MACC = 37 / 4 = 9.250.

For J: A(8), X(11), S(6), N(12). TOTAL 37. MACC = 37 / 4 = 9.250.

For V: Z(13), N(12). Total 25. MACC = 25 / 2 = 12.500.

 

Now I rank the ciphertext letters by listing them from lowest to highest corresponding MACC values:

G Z N X B K S W L P A M T J I U Q Y V

The idea behind the algorithm is that, at least theoretically—and practice bears this out significantly, as I have solved hundreds of difficult cryptograms this way—the idea is that the vowels should strongly tend to "float" to the top, or near the top, of the ranking. In this case, the leading symbols G, Z, N, X, B, K, … are statistically the most likely to stand for vowels.

The reasons for this are fairly elementary. Vowels contact a richer variety of letters than consonants do. I.e., consonants, by contrast, typically contact relatively few different letters. For each ciphertext letter, there are two things that can make the Mean Associated Contact Count (MACC) have a low numerical value so that the letter floats to a position near the top of the distribution: (1) the denominator in the fraction TOTAL / VCC being large, denoting a high number of different contact letters; and (2) the numerator in the fraction TOTAL / VCC being small, due to many of the contact letters having a low number of contacts themselves, suggesting that they are most likely consonants. Both phenomena gravitate toward the "root" letter (for which the contact list has been made) being a vowel.

One now takes note of the ciphertext letters leading off the MACC ranking, i.e. the letters most likely to stand for vowels; one examines the positions of such letters in the ciphertext words; and fortified by these observations one uses one’s knowledge of word structure and sentence structure to proceed with the processes of solution in the usual ways familiar to the cryptanalyst. In this case the solution turns out to be:

STODGILY PEDANTIC HOUSEMAID, GOURMET HUSBAND CHEF POLITELY INDULGE MUSICAL NEIGHBOR, GRATIFY SOCIAL FEAST SCENARIO.

This solution would have been a great deal more difficult without the vowel isolation procedure described. If we list the ciphertext letters and give their ultimate plaintext equivalents as known with the solution complete, we see that the vowels have indeed floated to the top of the distribution:

 

Cipher / Plain :

G Y

Z E

N O

X A

B U

K I

S H

W L

L R

P G

A S

M F

T M

J B

I D

U N

Q T

Y C

V P

Sometimes the vowel isolation algorithm may produce some anomalies. For example, the "liquid" consonants L and R are notorious vowel imitators, and they may "float up" among the vowels; note that in the example given, the plaintext letters L and R do end up fairly close to the top of the distribution; this is typical. Other letters, for one reason or another, may occasionally "float up" as well.

However, experience (my own and that of other cryptanalysts who have used the method, in some cases writing computer programs to run it) has shown that the method always works well enough to help identify most of the vowels, leading to a speedier solution of the whole problem. (As a rough rule of thumb, among the top seven or so letters in the MACC ranking, at least four can usually be assumed to be vowels.) My own statistical analysis of large bodies of text has shown that, not surprisingly, the longer the ciphertext, the more successfully the MACC method isolates all the vowels, or at least A, E, I, O, and U (Y being a semi-consonant and thus not quite so reliable). My studies of large amounts of English-language text suggests that overall the MACC ranking tends to run approximately as follows:

E O A I U L R N C P S T D G B H Y X M K W F V Q J Z

(It is understood that not all twenty-six letters may be present in short ciphertexts.)

Thus for sufficiently long texts, the five major vowels float to the top of the distribution closely followed by the liquids L and R, the behavior of the rest of the alphabet being somewhat less describable. Even in shorter texts, which can be viewed as statistical samples of the workings of the language as a whole, the tendency of the vowels to move to the top is reliable enough (as statistical sample behavior reflecting the tendencies of the language "population" at large) to be exceedingly helpful. And as vowels are the morphological sites at which words "breathe," their identification can only facilitate cryptanalytic solution in general.