Ads on New York’s and London’s subway systems for training in shorthand writing used to promise that “if u cn rd ths u cn gt a gd job.”Shorthand depends on the inherent redundancy of conventional writing, and to quantify this redundancy the founder of information theory, Claude Shannon, devised an experiment.1 He took a sentence at random from a Raymond Chandler novel and had an assistant guess each letter in turn. The assistant was told when the guess was correct and was told the correct letter when the guess was wrong. After the first few stabs in the dark, the assistant had the benefit of knowing all of the letters prior to the one being guessed. As we might expect, although the first letters of words were frequently wrong—who knows what word Chandler might use next?—once a letter or two were in place the remainder of each word was often easily guessed. Shannon concluded that overall English prose is about 75 percent redundant.
In this context, “redundancy” means predictability: after the letter t the letter h is much more likely to follow than x is, and directly after q the appearance of u is almost a certainty. The more likely a particular combination, the less information it carries: the u after a q is almost completely redundant. Shannon developed the mathematics for quantifying the information carried by any message in any coding system, and it works just as well for sequences of words in a sentence as for sequences of letters in a word. Shannon’s work allows us to quantify writers’ preferences for putting particular words in particular orders or, more generally, for placing them in proximity to one another.
When attributing authors to works of unknown authorship, evidence from favored phrases (particular words in particular orders) and from collocations (particular words appearing near one another) is frequently employed, although just how to isolate these characteristic phrases and collocations is not agreed upon. MacDonald P. Jackson’s approach is to take three- and four-word phrases [End Page 232] from the text to be attributed and to search for them in various parts of the Literature Online (LION) database, typically confining his searches to the “Drama” section’s holdings for plays first performed between 1590 and 1610.2 Phrases frequently found in other writers’ canons Jackson discards, and for the remaining rare phrases he counts the number of occurrences in each canon. After adjusting for the differing sizes of the canons—Shakespeare’s is so large that all other things being equal he would get more hits for that reason alone— Jackson looks for any one writer predominating in the hit list. If one writer has disproportionately more hits than the others, Jackson considers this reasonable evidence for that writer being the author of the text to be attributed. Brian Vickers’s method is essentially the same except that, rather than running every short phrase in his sample text through the search engine by hand, he relies on plagiarism detection software to find the matches between the sample and the large corpus of solidly attributed works. And instead of searching in LION, he uses a private database of electronic texts compiled by Marcus Dahl.3
Rather than searching for relatively rarely occurring phrases, it is possible to automatically search for quite common phrases or to compare the rates of occurrence of various rare or common words. Hugh Craig, John Burrows, and Arthur Kinney have had considerable success with the last of these approaches.4 Vickers [End Page 233] has strongly condemned counting the frequencies of individual words on the grounds that
words are not independent but interdependent: one word looks for another. A typical noun phrase includes a substantive, a definite or indefinite article, and a modifier, such as an adjective or superlative. . . . Each of these word classes needs the others. To separate them out reduces language to a severely limited lexicon.5
One reason that Craig, Burrows, and Kinney count words rather than phrases...