CHAPTER I

RESULTS OF SOME COMPUTER ANALYSES OF THE VALIDITY OF THE BEALE CIPHERS

One of the great outstanding challenges in cryptanalysis for many years has been to crack the Beale ciphers. These ciphers are supposed to have been enciphered by Thomas Jefferson Beale about 1822 and are supposed to contain messages describing the location of several million dollars worth of gold, silver and jewels buried in Bedford County Virginia. There are three ciphers, numbered as Cipher 1, Cipher 2 and deciphered form in Attachment 4 hereto. Cipher 1 is supposed to give the precise location of the treasure and Cipher 3 is supposed to give directions on how to divide the treasure among the heirs of Beale and his partners.

Cipher 1 is 520 enciphered characters in length and includes 298 different enciphered characters. The greatest frequency of occurrence is for enciphered character "18" with 8 occurrences. Cipher 2 is 763 enciphered characters in length and includes 183 different enciphered characters. The greatest frequency of occurrence is for enciphered character "818" representing the letter "v" with 18 occurrences. Cipher 3 is 618 enciphered characters in length and includes 263 different enciphered characters. The greatest frequency of occurrence is for enciphered character "96" with 13 occurrences.

William F. Friedman proposed the use of an index of coincidence K (for kappa) in which the probability of the coincidence of identical characters at certain positions in an enciphered message would be expressed as a percentage. He noted that in an average plaintext message in English, kappa is 6.67%.

For the purposes of the present study, it is also useful to note that, because of the occurrence of certain highly probable digraphs, trigraphs and common words in English, some letters are more likely to occur before or after a given letter than are other letters. This means that for positions adjacent to matching characters, the index of coincidence in those positions is higher than 6.67%. For reasons which are apparent upon reflection, the values of the index of coincidence are symmetrical for positions equidistant left and right from identical central characters.

A computer analysis for about 4000 plaintext English characters (actually four groups of about 1000 characters each) shows that KT+4 (kappa for coincidence are symmetrical for positions equidistant left and right from identical central characters.

A computer analysis for about 4000 plaintext English characters (actually four groups of about 1000 characters each) shows that KT+4 (kappa for plain text in positions four characters away from central characters) is about 7.1%, KT+3 is about 7.3%, KT+2 is about 9.2% and KT+1 (kappa for plain text immediately adjacent to identical characters) is about 13.2%, as compared to 6.67% for remote characters in a long text which would be predicted by Mr. Friedman's kappa.

The computer was also called upon to calculate the index of coincidence K1, K2 and K3 for Ciphers 1, 2 and 3 respectively. Cipher 1, when every possible comparison is made between the 520 enciphered characters, has 418 possible matches (called "hits") in 134,940 possible comparisons (called "tries"), whereby K1 = 0.310%. For every possible comparison between the 763 enciphered characters of Cipher 2, there are 2373 possible hits in 290,703 possible tries, whereby K2 = 0.816%. For every possible comparison between the 618 enciphered characters of Cipher 3, there are 894 possible hits in 190,653 possible tries, whereby K3 = 0.469%. Although the initial computation of these values was done by brute force of computer power, counting every possible comparison, Dr. Carl Hammer later pointed out to me that the same results could be achieved using the Gaussian equation h = n(n-1)/2 to calculate possible hits for each repeated enciphered character and to calculate possible tries for the total number of characters in each cipher.

The values of K1, K2 and K3 are equivalent in their respective ciphers to K = 6.67% in plain text. From the ratios of the adjacent plaintext kappas KT+1, etc., to the general kappa = 6.67% can be calculated adjacent kappas for each of the three ciphers by maintaining the same ratios with K1, K2 and K3. Thus, the adjacent kappas for the ciphers are approximately as follows:

For Cipher 1
K1+4 = 0.330% K1+3 = 0.363% K1+2 = 0.428% K1+1 = 0.613%
For Cipher 2
K2+4 = 0.869% K2+3 = 0.954% K2+2 = 1.126% K2+1 = 1.615%
For Cipher 3
K3+4 = 0.499% K3+3 = 0.548% K3+2 = 0.647% K3+1 = 0.928%

Justification for such calculations is as follows: Assuming an enciphered English text, the existence of matching letter values in positions adjacent to identical central characters in the plaintext is a function only of the plaintext, not of the cipher. Thus it is a function only of the statistical probability of matching letter values in those adjacent positions in the plaintext. Then, assuming matching letter values, the existence of matching enciphered characters in the enciphered text is a function only of the statistical probability that identical plaintext characters will be represented by the same enciphered characters. Since these are independent statistical variables, their product represents the probability that matching adjacent-position plaintext values will occur and be represented by the same enciphered characters. To show that (K3+1)/K3 = (KT+1)/KT is an accurate statement, it must be shown that the product of two probabilities is represented by the result. It is clear that KT+1 represents the statistical probability of the occurrence of matching letters in immediately adjacent positions in the plaintext. Thus the product of two probabilities becomes K3+1 = (K3/KT)x(KT+1), and K3/KT must be shown to represent the probability that identical plaintext characters will be enciphered by the same enciphered characters. Since it is assumed that identical enciphered characters will always represent identical plaintext characters, all of the enciphered characters predicted by K3 must represent all of the plaintext characters predicted by KT. Thus, it must be true that K3/KT represents the stated probability and that the first statement given is accurate.

If Ciphers 1 and 3 are enciphered versions of English-language texts in the same fashion as is Cipher 2, one would expect that their indices of coincidence for positions adjacent to identical characters would approximate those calculated above. Determination of the probable deviations from these indices will be treated later, and in fact, that determination formed the major part of these computer analyses.

For the present, since there are 418 possible comparisons which can be made with identical central enciphered characters in Cipher 1, 2373 possible comparisons in Cipher 2, and 894 possible comparisons in Cipher 3, one would expect, on average, to obtain hits in an enciphered text as follows, where for example H3+4 indicates hits in Cipher 3 at positions four away from identical central characters and H3 indicates hits in Cipher 3 at positions remote from central characters, i.e., corresponding to K3. By example, H3+4 = 894 X K3+4

For assumed average Ciphers 1, 2, and 3 (Ideal)
For Cipher 1
H1 = 1.3 H1+4 = 1.4 H1+3 = 1.5 H1+2 = 1.8 H1+1 = 2.6
For Cipher 2
H2 = 19.4 H2+4 = 20.5 H2+3 = 22.5 H2+2 = 26.6 H2+1 = 38.1
For Cipher 3
H3 = 4.2 H3+4 = 4.4 H3+3 = 4.9 H3+2 = 5.8 H3+1 = 8.3

Using the same notation, the following is obtained from the Beale ciphers as given. This indicates the number of hits in various positions.

For Cipher 1 as given in Attachment 1, 2 as in 2 and 3 as in 3
For Cipher 1
H1+4 = 2 H1+3 = 1 H1+2 = 2 H1+1 = 1
For Cipher 2
H2+4 = 20 H2+3 = 19 H2+2 = 27 H2+1 = 60
For Cipher 3
H3+4 = 6 H3+3 = 4 H3+2 = 5 H3+1 = 3

The most appropriate way to compare the expected hits in four adjacent positions for an enciphered text with the hits actually obtained would appear to be by calculation of a coefficient of correlation. The method of calculation of the coefficient of correlation is shown in Attachment 5. It is noted, however, that the coefficient of correlation compares only the shapes of the curves formed by the two sets of data and ignores the amplitude differences of the curves. To some extent, this may be an advantage in that, if the text underlying a cipher has different adjacent kappa values from the 4000-character sample which provided the values used herein, and if those different adjacent kappa values are approximately proportional to those used herein, no large error should result.

When the average or ideal number of hits for Cipher 1 is compared with the actual (called "original") Cipher 1, the coefficient of correlation is found to be -0.478. The coefficient of correlation for Cipher 2 is +0.984 and for Cipher 3 is -0.805. It is obvious that the degree of correlation for Cipher 2 is high and the degrees of correlation for Ciphers 1 and 3 are low. However, there is no apparent way to calculate directly the degree of variation in the coefficients of correlation which should be expected for different enciphered texts.

A solution to this problem, involving hundreds of hours on a microcomputer, was to produce thirty purely random rearrangements of the cipher numerals of each cipher and an additional thirty rearrangements of each cipher which were random except that each cipher was constrained to encipher a text. The purpose of these random and constrained random ciphers was to provide samples upon which to test indirectly that which apparently cannot be tested directly -- the probable deviations of the number of hits obtained in texts which have been enciphered using the sets of numerals used in the three ciphers and in randomly shuffled sets of those numerals. The purpose of these tests was to make it possible to determine whether it is more probable that the numbers and positional arrangement of hits (as measured by the coefficients of correlation) actually obtained in the three cipher texts would have occurred in a random arrangement of the set of cipher numerals or in a set of those cipher numerals used to encipher an actual English-language plaintext.

In any constrained random version, if cipher numeral "818" represents "R" in any occurrence of the numeral, it represents "R" for all occurrences of the numeral throughout the version. Since a text was needed for the constrained random version, the Beale-message text from Attachment 4 (or the first 520 or 618 characters of it) was used as the text which the constrained random versions were constrained to encipher. The computer was instructed to assign cipher numerals at random to letters in the text, starting with the most frequently occurring numeral and assigning it to a randomly chosen text letter which occurs with at least the same frequency, assigning the rest of the occurrences of that numeral to randomly chosen occurrences of the same text letter, removing the text letters thus enciphered from the list of those still available for enciphering, then proceeding to the next most frequent numeral or a randomly chosen numeral of the same frequency, and repeating the process of assigning numerals to letters until all numerals with a frequency of at least two have been assigned to letters. For speed of computation and because the result is not affected, numerals with only one occurrence in the cipher are assigned to the remaining letters in the most expeditious non-random fashion.

When these rearranged versions were derived, the resulting number of hits in each of four adjacent positions were calculated and the coefficients of correlation were obtained. The results for Ciphers 1, 2 and 3 for thirty runs of random cipher and thirty runs of Beale-text constrained random cipher are given respectively in Attachment 6, Attachment 7, and Attachment 8. The results for Ciphers 1, 2 and 3 are plotted in ascending order of coefficients of correlation respectively in Attachment 9, Attachment 10 and Attachment 11.

Examination of the graph for Cipher 2 in Attachment 10 shows that the coefficient of correlation for original Cipher 2 falls well within the range of the constrained random versions. This is as expected, since original Cipher 2 does in fact encipher a text.

The graph in Attachment 9 shows that the -0.478 coefficient of correlation for original Cipher 1, while on the low end, occurs well within the range of the constrained random versions of Cipher 1. It is impossible to determine from this test whether or not Cipher 1 enciphers a text.

However, it is a different situation with the graph in Attachment 11 for Cipher 3. The -0.805 coefficient of correlation for original Cipher 3 is well outside the range of coefficients of correlation obtained with Beale-text constrained random versions of Cipher 3. It can be said from this graph that it is improbable that any cipher arrangement using the same numerical values as Cipher 3 can encipher a text and still have a coefficient of correlation as low as does original Cipher 3 (assuming the order given in Attachment 3 and the general enciphering scheme used with Cipher 2).

Before an attempt to determine how improbable it is that Cipher 3 enciphers a text, it seemed desirable to run thirty more tests each for three additional texts to reduce the chance that some anomaly in the Beale text was responsible for the range of coefficients of correlation obtained in the earlier tests. The three additional texts chosen were the Gettysburg address, the list of signers of the Declaration of Independence, and selected members of my graduating class from the University of Virginia (selected for names and addresses that could have occurred in 1822).

Attachment 12 is a graph of the coefficients for the 120 constrained random versions of Cipher 3, plotted in ascending orders of coefficients of correlation. The lowest coefficient of correlation of the 120 versions was -0.773, which is still not as low as the -0.805 in original Cipher 3. It thus appears that there is something less than one chance in one hundred of achieving a coefficient of correlation as low as -0.805 using the numerical values provided in Cipher 3 to encipher an actual text. It appears to be highly improbable that Cipher 3 as given enciphers a text in the same manner as Cipher 2 does (i.e. that Cipher 3 is to be read in the order given and that each numerical value in the enciphered text, regardless of how many times it occurs, enciphers the same alphabetical character at each occurrence).

From the results of the tests on Cipher 1, it is impossible to state whether Cipher 1 enciphers a text. However, considering the stated purposes of Ciphers 1 and 3, where Cipher 1 was to tell a trustee where to find a buried treasure and Cipher 3 was to tell the trustee how to distribute the treasure, it appears inherently unlikely that Beale left a message in Cipher 1 telling how to find the treasure and neglected to leave a message in Cipher 3 telling how to distribute it.

It has been suggested that the ciphers are to be read in some other order than the one given in order to decipher them. For example, one might read every other numeral beginning with the first numeral to the end of the cipher, then read every other numeral beginning with the second numeral to the end of the cipher. But one set of facts seems to suggest that this is not a correct course of action.

Considering the care which was taken by Beale (or some person using that name) to use many different numerical values to represent the same letters, it is unlikely that Beale would use the same numerical value in two immediately adjacent positions to represent the same letter, because to do so would make the cipher easier to break. An examination of Cipher 2 shows that no numerical value is repeated in an immediately adjacent position or in a position having only one intervening position (called "semi-adjacent").

However, examination of Ciphers 1 and 3 shows that, in the order in which the ciphers were originally written, there are no numerical values repeated in immediately adjacent positions or in semi-adjacent positions. If the ciphers are intended to be read in some other order in which they encipher a message, then the "original" ciphers would have to be a substantially random rearrangement of the message order. Tests of a large number of pure random versions of Ciphers 1 and 3 show that there is about a 20% chance of achieving a random rearrangement of Cipher 1 having no numerical values repeated in immediately adjacent positions, an independent 20% chance of achieving a random rearrangement of Cipher 1 having no numerical values repeated in semi-adjacent positions, and an approximately 7% chance of each of those accomplishments with Cipher 3. Thus, assuming that the existing situation of having no repeated numerical values in immediately adjacent or semi-adjacent positions in Ciphers 1 and 3 is a desired end, there is only one chance in 5000 that this desired end was achieved by rearranging enciphered messages which were enciphered in other orders. It thus appears unlikely that Ciphers 1 and 3 are intended to be read in some order other than that in which they are generally written.

For those who are not daunted by 5000 to one odds, there are three rearrangements of Cipher 3 which offer no immediately adjacent repetitions of the same cipher numerals (there are semi-adjacent repetitions) and which offer relatively high coefficients of correlation. The first method of rearrangement is to write the numerals left to right in two columns, then read down the columns successively (which is equivalent to reading every other numeral beginning with the first through to the end, then reading every other numeral beginning with the second through to the end). This gives hits of 4 3 6 5, for a coefficient of correlation of +0.492. The second and third methods of rearrangement involve writing the cipher in eleven columns, then reading down the columns. In the second method, all spaces in the otherwise rectangular arrangement are left in the last row, and in the third method, all spaces are left in the last column. The second method gives hits of 3 2 3 3, for a coefficient of correlation of +0.365, and the third method gives hits of 1 1 2 2, for a coefficient of correlation of +0.800. Although there are several rearrangements of Cipher 1 which offer no immediately adjacent repetitions of cipher numerals, the two which appear to have the best coefficients of correlation available are the two-column rearrangement, giving hits of 0 1 2 2 and a coefficient of correlation of +0.752, and a sixteen-column rearrangement with all spaces in the last column, giving hits of 1 1 1 3 and a coefficient of correlation of +0.950. These rearrangements would appear to provide the best chance of solving the ciphers, assuming that there is something there to solve.

My personal conclusion, based upon the statistical values set forth above, is that there is no content to these ciphers and that someone, perhaps in the nature of a practical joke, chose the values of Ciphers 1 and 3 substantially at random, limited only by the restriction that no numerical value would be repeated immediately adjacent or semi-adjacent to a similar value.

Go to top page of Beale Code material

.