The genome of every organism is composed of millions of small chemical units called nucleobases, whose arrangement along a strand of DNA provides the recipe for the development of the organism. 1 Bioinformaticists analyze patterns of bases in DNA, and by treating bases as an alphabet and genomes as texts have reinvented a number of techniques originally developed by philologists. 2 But the sheer size of genomic texts has also forced bioinformatics to move beyond traditional philological methods and to use information processing and statistical techniques to analyze patterns that are otherwise too large or too subtle to be noted by the unaided eye. We have adapted some of these computer-aided methods from bioinformatics for the purpose of analyzing medieval texts, and demonstrate in this paper that our "lexomic" methods can be used to detect relationships between, and structures within, poetic texts in the Old English corpus. Specifically, we show that our methods can recognize the relationship between Daniel and Azarias and the divisions between Genesis A and B as well as those between Guthlac A and B and between Christ I, II, and III. We also demonstrate how lexomics can be used to shed light on the possible Cynewulfian affinity of Guthlac B and conclude that certain [End Page 301] peculiarities in branching diagrams may be diagnostic of the existence of outside sources for Old English poems.
The term "lexomics" was originally coined to describe the computer-assisted detection of "words" (short sequences of bases) in genomes. 3 When applied to literature, lexomics is the analysis of the frequency, distribution, and arrangement of words in large-scale patterns. The term "lexomics" is perhaps even more applicable to the specific use we make of it here than in its original context, if only because a "word" in a written language is an obvious, well defined, and relatively uncontroversial category. 4
Our methods are implemented as a series of programs or scripts written in the language Perl, freely available at the project website (http://lexomics.wheatoncollege.edu). The open-source scripts enable a researcher to: (i) sort texts into directories (poetry vs. prose, then by manuscript), (ii) "cut" texts into equal-sized chunks, (iii) count the number of times every word occurs in each text or chunk, (iv) generate statistical summaries of texts or chunks, and (v) prepare counts of words for classification and/or cluster analyses.
We first identify and count the words in a text. Our project focuses on texts in Old English where, fortunately, a complete corpus of Old English poetry has been assembled by the Dictionary of Old English project, greatly simplifying an otherwise onerous task. 5 The Dictionary of Old English identifies individual words as strings of characters bounded by white space. Our software tabulates all the words in any group of texts [End Page 302] and calculates the number of unique words and the number of words which appear only once. 6 The table of words from any group of texts is created as comma-delimited data and can be used in any spreadsheet application (e.g., Microsoft Excel). These data can then be analyzed with various statistical techniques.
Click for larger view
View full resolution
It is important to note some complexities that must be dealt with when analyzing any texts as well as some problems unique to the Old English corpus. First, there are the problems associated with using edited or normalized texts versus diplomatic editions. For example, the Dictionary of Old English corpus uses Arnold Schröer's edition of the Rule of St. Benedict for its electronic version of that text. Schröer collated five manuscripts dating from the end of the tenth to the beginning of the twelfth century, so his edition does not reflect any single extant manuscript. 7 Any conclusions drawn from analysis of the text in the DOE corpus are therefore in part dependent upon the editorial decisions and normalizations made by Schröer. Researchers can...