publisher colophon

The genome of every organism is composed of millions of small chemical units called nucleobases, whose arrangement along a strand of DNA provides the recipe for the development of the organism. 1 Bioinformaticists analyze patterns of bases in DNA, and by treating bases as an alphabet and genomes as texts have reinvented a number of techniques originally developed by philologists. 2 But the sheer size of genomic texts has also forced bioinformatics to move beyond traditional philological methods and to use information processing and statistical techniques to analyze patterns that are otherwise too large or too subtle to be noted by the unaided eye. We have adapted some of these computer-aided methods from bioinformatics for the purpose of analyzing medieval texts, and demonstrate in this paper that our "lexomic" methods can be used to detect relationships between, and structures within, poetic texts in the Old English corpus. Specifically, we show that our methods can recognize the relationship between Daniel and Azarias and the divisions between Genesis A and B as well as those between Guthlac A and B and between Christ I, II, and III. We also demonstrate how lexomics can be used to shed light on the possible Cynewulfian affinity of Guthlac B and conclude that certain [End Page 301] peculiarities in branching diagrams may be diagnostic of the existence of outside sources for Old English poems.

Lexomic Methods

The term "lexomics" was originally coined to describe the computer-assisted detection of "words" (short sequences of bases) in genomes. 3 When applied to literature, lexomics is the analysis of the frequency, distribution, and arrangement of words in large-scale patterns. The term "lexomics" is perhaps even more applicable to the specific use we make of it here than in its original context, if only because a "word" in a written language is an obvious, well defined, and relatively uncontroversial category. 4

Our methods are implemented as a series of programs or scripts written in the language Perl, freely available at the project website (http://lexomics.wheatoncollege.edu). The open-source scripts enable a researcher to: (i) sort texts into directories (poetry vs. prose, then by manuscript), (ii) "cut" texts into equal-sized chunks, (iii) count the number of times every word occurs in each text or chunk, (iv) generate statistical summaries of texts or chunks, and (v) prepare counts of words for classification and/or cluster analyses.

We first identify and count the words in a text. Our project focuses on texts in Old English where, fortunately, a complete corpus of Old English poetry has been assembled by the Dictionary of Old English project, greatly simplifying an otherwise onerous task. 5 The Dictionary of Old English identifies individual words as strings of characters bounded by white space. Our software tabulates all the words in any group of texts [End Page 302] and calculates the number of unique words and the number of words which appear only once. 6 The table of words from any group of texts is created as comma-delimited data and can be used in any spreadsheet application (e.g., Microsoft Excel). These data can then be analyzed with various statistical techniques.

Table 1. 10 most frequently occurring words in the poem Azarias.
Click for larger view
View full resolution
Table 1.

10 most frequently occurring words in the poem Azarias.

Corpus-Specific Variations

It is important to note some complexities that must be dealt with when analyzing any texts as well as some problems unique to the Old English corpus. First, there are the problems associated with using edited or normalized texts versus diplomatic editions. For example, the Dictionary of Old English corpus uses Arnold Schröer's edition of the Rule of St. Benedict for its electronic version of that text. Schröer collated five manuscripts dating from the end of the tenth to the beginning of the twelfth century, so his edition does not reflect any single extant manuscript. 7 Any conclusions drawn from analysis of the text in the DOE corpus are therefore in part dependent upon the editorial decisions and normalizations made by Schröer. Researchers can modify the DOE files to make them consistent with any one manuscript and then perform their research on the modified files, and there may be good reasons for using edited texts (for [End Page 303] example, the removal of obvious errors), but scholars must determine on a case-by-case basis which form of the text is most suited to answering a given set of questions.

Second is the problem of orthographic or spelling variations within texts. There is some difficulty in determining which variations are meaningful in relation to a given question. For example, Old English uses two different letters, thorn (þ) and eth (ð), to indicate both voiced and unvoiced interdental fricatives. Most scholars of Old English agree that there is not a consistent pattern in the use of thorn and eth, suggesting that for many questions, orthographic variation may be meaningless because there is no phonetic difference between þa and ða, particularly in manuscripts that were copied later in the Anglo-Saxon period. However, since thorn did enter Old English orthography earlier than eth, the difference between þa and ða could be relevant for some studies. A similar issue arises from the representation of Tironian note, 7, which could be expanded in Old English to either and or ond. The proportion of and versus ond in a text can indicate dialect, but there are many texts in which scribes seem to alternate between the two without following a consistent pattern. Therefore, expanding the note as either and or ond changes the data, potentially obscuring or creating different patterns than those present before. No simple solution exists for these issues; researchers must make their own determinations for different questions, and to facilitate this customization of research, we have created programs which allow for, but do not require, consolidating all eths to thorns, collapsing all Tironian notes, ands and onds to and, and eliminating all tagged terms (for example, phrases in languages other than Old English, corrections, and additions) from the texts. Researchers can choose which, if any, of these consolidations to use in any inquiry. 8

It is also important to note that the words in the tables created and counted by our program are not lemmatized. Instead, morphological and grammatical variants are treated as different words; for example, cyning, cyninge, and cyninga are all counted separately, as is kyning. There are a variety of reasons for avoiding lemmatization at this time. First, lemmatization introduces problems of tedious mark-up and individual (possibly controversial) judgments that we hope to avoid by the use of information processing tools. Scholars must decide if, for the purpose of a specific inquiry, kyning is functionally equivalent to cyning; perhaps the difference in the spelling of the initial sound is actually relevant data. Embedding lemmatization in the original sorting and counting makes it impossible to account for variations of kyning at a later point in the inquiry. Second, in [End Page 304] poetry and perhaps in prose as well, inflected forms of words are almost certainly relevant data: the dative plural, for example, could be used in certain poetic environments and not in others, and so consolidating it to the nominative singular lemma of a word runs the risk of eliminating valuable data. Finally, as we discuss below, the analyses performed with unlemmatized texts yield results consistent with traditional philological analyses, suggesting that for these particular types of problems, lemmatization is not necessary. 9

Measuring and Depicting similarity and Difference

Computational stylometric analysis is a comparatively new approach to literary analysis, but in the past two decades much important work has been done. For example, John F. Burrows pioneered stylometric methods, employing statistical analyses that use collections of function words to build textual "signatures," which he originally used to attribute authorship to a set of English Restoration poems. 10 Burrows's method takes the most common words in a given corpus of text, counts the words in each text, standardizes each count using means and standard deviations of word-use between texts, and stores the standardized word counts for each text into an array of numbers. He then uses statistical methods to analyze these data. David Hoover has further refined Burrows's methods and applied them to prose in third-person American novels. 11

Because there is a quite limited stock of named Anglo-Saxon authors to whom poems might be attributed, our methods have developed somewhat differently than those of Burrows and Hoover. Rather than using only the most common words, we use all the words in the texts being studied. Similar to Burrows and Hoover, we compute the relative frequency (i.e. proportion) of each word within each text or chunk of text for every word in the entire collection of texts. If there are 1000 words in a text and ond appears 50 times, we record 50/1000 = 0.05 as the relative frequency of ond in that text. Similarly, if weard does not appear in a particular text but does appear somewhere in the collection of texts, then we record [End Page 305] 0/1000 = 0 for the relative frequency of weard in that text. The result is an n-dimensional array for each text, where n represents the number of distinct words used in the entire collection of texts being studied. 12

At this point any number of statistical analyses can be used, the two most common being principal component analysis and cluster analysis. We use the free implementation of hierarchical, agglomerative cluster analysis 13 within the statistical software package, R, 14 to group the texts and create branching diagrams (dendrograms) of their relationships. This clustering method uses a dissimilarity (or distance) metric between texts for the grouping of texts without prespecifying the number of groups. In the analyses presented in this paper, we employ the most commonly used distance metric, Euclidean distance, a multidimensional extension of Pythagoras's theorem for right triangles. This metric makes use of all n words in a collection of texts to measure the dissimilarity between two texts. 15 The distance (or dissimilarity) measure is computed for each pair of texts among T texts, resulting in T×(T-1)/2 distances, which are then used to create groupings, or clades, 16 of texts by clustering texts that are most similar. If we want to compare four texts, we first list all the words in each text and calculate the relative frequency of each word in each text. We then compute (4×3)/2=6 distances, one for each pair of texts; calculate the difference between the proportion of a word's use in text1 and text2; square that difference; and total the squared differences from each word. The distance, then, is the square-root of the squared distance. 17 Then, to measure the distance between clades, we use the Euclidean distance between the multidimensional averages of the two clades. We then use hierarchical agglomerative clustering to order these distances to construct a dendrogram. The dissimilarity between clades is represented by the vertical length of the line connecting the clades. 18 This graphical representation of the distances indicates that texts 3 and 4 in Figure 1 are most similar; text 2 is closer to the clade, γ, which contains both 3 and 4; and text 1 is least like the other texts. In this example, the vertical distance between 3 [End Page 306] and 4 is very small, indicating that they are very similar, while the vertical distance between 1 and clade β is much larger, indicating that 1 is quite different from the remaining texts.

Any level of the branching diagram can be identified as a clade, and we label clades from left to right using Greek letters, first labeling all clades at the same level of the hierarchy and then descending to the next level and again labeling left to right. Thus in Figure 1, the text is made up of two major clades, a and β. Clade α contains text 1; clade β contains 2, 3, and 4, and clade γ contains only texts 3 and 4. Because clade α contains only one text, it is said to be single leafed or simplicifolious. 19

For many analyses it is useful to divide large texts into smaller pieces, which can then be compared as if they were stand-alone texts. We call the subsections of a text "chunks," and identify them both with their order in any given text and with the range of the actual words that are included in the chunk. For example, Dan51801-2250 is the fifth chunk of Daniel when that poem is cut into nonoverlapping chunks that are 450 words long, and the chunk therefore includes all the words from the 1801st to the 2250th word in the poem (these numbers are given in the subscript).

Detecting the Relationship of One Poem to Another: Daniel and Azarias

In order to test the effectiveness of our methods, we applied them to texts in the Old English corpus whose relationships with each other are already known. We reason that if lexomic methods correctly identify the

Figure 1. Sample dendrogram showing the results of a cluster analysis using four chunks of text. Clades are labeled with Greek letters.
Click for larger view
View full resolution
Figure 1.

Sample dendrogram showing the results of a cluster analysis using four chunks of text. Clades are labeled with Greek letters.

[End Page 307]

relationships of such texts, we can have reasonable confidence in the relationships suggested by our methods when these are applied to texts whose interrelationship is unknown or controversial. We therefore used lexomic methods to analyze Daniel and Azarias, poems whose relationship, while not understood in every detail, is already known to scholars.

Daniel is found in the Junius Manuscript, Oxford, Bodleian Library, Junius 11. 20 The manuscript was copied around the year 1000, or perhaps a few decades earlier, by four different scribes. 21 It contains four poems, Genesis, Exodus, Daniel, and Christ and Satan. One portion of Daniel, the "Song of the Three Youths" (ll. 279-361) 22 has a counterpart in another of the major Anglo-Saxon poetic codices, Exeter Cathedral Library 3501, "The Exeter Book" 23 : the third poem in the Exeter Book, Azarias, 24 corresponds roughly with the "Song of Azarias" in Daniel (see Figure 2). Although Gollancz believed that the author of Azarias knew Daniel, 25 current scholarship traces the two poems back to a common antecedent. 26

To a reader of Old English (or even a casual observer), Daniel and Azarias are clearly related to each other: many lines are very similar in word [End Page 308] choice and structure even though there are small variations in nearly every line. In order to indicate the extent of the similarity between the two poems, we have in the examples below bolded exact matches, where one word matches another letter for letter, and italicized words which differ only slightly from each other.

Figure 2. Relationship of Azarias to Daniel.
Click for larger view
View full resolution
Figure 2.

Relationship of Azarias to Daniel.

Daniel (l. 279): Ða Azarias ingeþancum

Azarias (l. 1): Him þa 27 Azarias ingeþoncum

Daniel (l. 280): hleoðrade halig þurh hatne lig

Azarias (l. 2): hleoþrede halig þurh hatne lig

Daniel (l. 281): dreag dæda georn, drihten herede

Azarias (l. 3): dreag dædum georn, dryhten herede

Later in the poems there is somewhat more substantive variation:

Daniel (l. 303): fracoð and gefræge folca manegum

Azarias (l. 24): fracuð and gefræge foldbuendum

Daniel (l. 348): se him cwom to frofre and to feorhnere

Azarias (l. 54): cwom him þa to are ond 28 to ealdornere

These variations present no difficulty for a reader of Old English, but they are a challenge for a computer program that counts words literally. For example, our program will count "ingeþancum" separately from "ingeþoncum" and "hleoðrade" separately from "hleoþrede" even though the two pairs are only minor spelling variants of each other. Thus, for the purposes of measuring similarity, it would seem that lemmatization could potentially be useful. A lemmatized text would count "ingeþancum" as being the same as "ingeþoncum," "hleoðrade" as being the same as [End Page 309] "hleoþrede" and so forth. However, as noted above, lemmatization is time consuming and not entirely objective, and it is not our goal to have the computer do things poorly that the human mind does well. Therefore, instead of lemmatizing, we use unmodified texts from the DOE corpus. We noted that, as the prevalence of bolded words in the above passages may indicate, there was substantial vocabulary overlap between the two passages even without lemmatization. So we sought to determine if lexomic methods could allow us to identify the section of Daniel (ll. 279-361) that matches Azarias. Daniel is 4472 words long, Azarias is 1064 words, and the portion of Daniel that matches Azarias is 475 words long. We therefore divided Daniel into ten "chunks," nine chunks of 450 words each and one chunk of 422 words. 29 We counted and tabulated the words in each chunk; performed cluster analysis, using the statistical software, R, on the ten Daniel chunks along with Azarias (a total of eleven "texts"); and plotted these relationships on a dendrogram (Figure 3).

As is evident from Figure 3, the chunk of Daniel that corresponds to Azarius, Dan51801-2250 (hereafter "DAZ"), is the most similar to Azarias. This chunk consists of words 1801 through 2250, almost exactly the lines of Daniel that are paralleled in Azarias. The dendrogram shows that when

Figure 3. Dendogram showing the results of a cluster analysis using nine 450-word chunks from Daniel and one (ending) chunk of 422 words, and the entire 1064-word poem Azarias (AZ). The consecutive, non-overlapping 450-word chunks of Daniel are labeled from 1 to 10 where the intial chunk is labeled Dan1. The exact word boundaries of each chunk are labeled under each leaf on the dendrogram; for example, the fourth chunk of Daniel (Dan4) comprises the 450 words of Daniel from word 1351 to word 1800, inclusive.
Click for larger view
View full resolution
Figure 3.

Dendogram showing the results of a cluster analysis using nine 450-word chunks from Daniel and one (ending) chunk of 422 words, and the entire 1064-word poem Azarias (AZ). The consecutive, non-overlapping 450-word chunks of Daniel are labeled from 1 to 10 where the intial chunk is labeled Dan1. The exact word boundaries of each chunk are labeled under each leaf on the dendrogram; for example, the fourth chunk of Daniel (Dan4) comprises the 450 words of Daniel from word 1351 to word 1800, inclusive.

[End Page 310]

clustering the chunks of Daniel and the whole of Azarias, DAZ and Azarias are significantly closer to each other than they are to other chunks of Daniel. Next closest are Dan41351-1800 and Dan62251-2700, the chunks immediately preceding and immediately following DAZ. These four chunks form one clade, all of which are related to Azarias: DAZ and Azarias by being related to each other, Dan41351-1800 and Dan62251-2700 because each of those chunks contains a bit of the Azarias material that spills out beyond the boundaries of DAZ. A second clade is formed by the remainder of Daniel: chunks 1-3 form their own sub-clade, as do chunks 7-10. 30

By measuring the distances between the vocabularies of the chunks of Daniel and Azarias, we can accurately characterize the known relationships of the chunks. DAZ is indeed closest to Azarias, and the chunks that abut DAZ are in turn closer to DAZ/Azarias than to the other chunks of Daniel. These results suggest that the variations in the unlemmatized vocabulary of the two poems are not enough to skew the method's identification of similarity. Even though Daniel has "ingeþancum" and Azarias has "ingeþoncum," the other similarities are sufficient for us to identify correctly the relationship of DAZ to Azarias. We therefore provisionally conclude that a lexomic approach using cluster analysis correctly identifies the relationship between Daniel and Azarias. Despite the many minor variations in spelling and the fewer (but significant) variations in word choice, the lexomic method is able to match Azarias with the correct part of Daniel.

Controls and Variations

Arrangement:

To be certain that the relationships detected in the above analyses were not caused simply by chance, and also to make sure that the entire relationship was not based on a few key words, we randomly permuted the words of Daniel so that the entire text was scrambled. The resultant text contained all the words of Daniel, but they were now in different, random places in the text. As with the original experiment, we then chunked this scrambled poem into nine 450-word sections and one 422-word section and clustered these ten chunks along with Azarias. As is evident from Figure 4, the scrambled text's chunks exhibit little variation from each other, and all are essentially equidistant from Azarias, which is now very different from any of the chunks of the scrambled Daniel. The extreme compression of the vertical scale on this dendrogram indicates that all the chunks of Daniel are now very similar to each other and that [End Page 311] no one of them is particularly similar to Azarias. We repeated this experiment several times, randomly scrambling Daniel each time, and found the same basic dendrograms: Azarias sits by itself with the chunks of Daniel in another cluster linked to each other with almost no difference in the vertical scale and in random order.

Figure 4. Dendrogram from a cluster analysis of the scrambled text of Daniel into nine 450-word chunks of Daniel and one (ending) chunk of 422 words, and the entire 1064-word poem Azarias (AZ). The consecutive, non-overlapping scrambled 450-word chunks of Daniel are labeled S_Dan1 to S_ Dan10, where the initial chunk is labeled S_Dan1 to represent the first chunk of the scrambled text of Daniel.
Click for larger view
View full resolution
Figure 4.

Dendrogram from a cluster analysis of the scrambled text of Daniel into nine 450-word chunks of Daniel and one (ending) chunk of 422 words, and the entire 1064-word poem Azarias (AZ). The consecutive, non-overlapping scrambled 450-word chunks of Daniel are labeled S_Dan1 to S_ Dan10, where the initial chunk is labeled S_Dan1 to represent the first chunk of the scrambled text of Daniel.

We also ran the experiment with the word "Azarias" removed from both poems, since this word occurs more frequently in DAZ and Azarias than in the rest of Daniel. When we clustered the eleven chunks with the word "Azarias" removed from the texts, the results were unchanged. We therefore conclude that the dendrogram is not shaped by the single word "Azarias"; instead, the methods detect this relationship based on the distances of many (or all) items in the poems' vocabulary.

These experiments act as controls for the method, demonstrating that the relationship between Daniel and Azarias illustrated by Figure 3 above is caused by the particular words that are in each chunk. Furthermore, we are even more confident that the relationship the cluster analysis finds between Azarias and DAZ is derived from a significant number of vocabulary similarities, not just a few words, because if only one or two words were creating the link between Azarias and the DAZ chunk, we would see a dendrogram similar to Figure 3 but with a different chunk close to Azarias for each randomization. [End Page 312]

Chunk size:

The original choice of 450-word chunks was driven by the size of Daniel: we began with ten chunks because this was a manageable number and because the DAZ section of Daniel is 475 words. But once we had established that 450-word chunks would give us the correct answer, we began experimenting with other chunk sizes. If a smaller chunk size still identified Azarias with DAZ, we could increase the resolution of the technique, allowing us to examine small poems or sections of poems. Likewise, if a larger chunk size still returned a correct dendrogram, we might more easily investigate larger poems. We therefore took our existing 450-word chunks of Daniel and divided each of them in half, creating twenty 225-word chunks. We also divided Azarias into four 266-word chunks. The resulting dendrogram separated Daniel from Azarias, but the DAZ chunks were no longer matched to Azarias, and non-DAZ material was mixed in a clade with DAZ. We interpret these results as showing that a resolution of 225 words is near or perhaps even below the lower limit of accuracy for these particular texts. When the number of words in any given chunk becomes too small, uncommon words can have a disproportionate effect on the calculation of the distance between chunks. Optimal chunk size is not only problem-specific (if we are searching for a 450-word passage, we probably do not want to use 1500-word chunks) but also affected by the heterogeneity of the vocabulary of the text. 31

To seek an upper bound for effective chunk size, we doubled our 450-word chunks to produce 900-word chunks for Daniel. Because Azarias was 1064 words long, we left it as a single chunk. The dendrogram produced by this experiment was essentially the same as that we constructed for the 450-word chunks of Daniel. Again, Azarias was much closer to DAZ than to any other part of Daniel. The same results held when we divided Daniel into three 1500-word chunks. Then, even though DAZ makes up only two-thirds of the Dan21501-3000 chunk, the cluster analysis correctly links it to Azarias. In the dendrogram, however, the vertical difference between DAZ and the rest of Daniel shrunk considerably, suggesting that chunk sizes beyond 1500 words, at least when seeking to match a section of closer to 1000 words, are probably less accurate than somewhat smaller chunk sizes (900 and 450 words): the 1500-word DAZ included so much ordinary Daniel material that some of the relationship between the "Song of the Three Youths" material was partially obscured.

Based on these experiments, we conclude that, at least for poetic texts whose vocabulary is approximately as heterogeneous as that of Daniel and Azarias, we are very confident in analyses based upon a chunk size of between 400 and 1500 words, with a preference for chunks between 450 [End Page 313] and 900 words. Analyses based on chunks as small as 225 words are likely to be correct in their broad outlines but may suffer from noisier variation.

Reducing the Number of Words Examined:

At different points in our analysis we were concerned with the possibility that our methods are too dependent upon a few specific, rare words and were identifying Azarias with DAZ based only on the word "Azarias" or a few other proper nouns. 32 We showed by deleting "Azarias" from the text that this word was not significantly affecting the overall similarity measure, and we could also have deleted each proper name from the texts individually by hand. But because one of our goals was to minimize individual mark-up, we instead decided to perform the experiments in such a way that no rare word could influence the final calculation. We therefore ran our tests using only those words that appeared in every single chunk that was to be analyzed. 33 Therefore no rare word or words (for example, ones that appeared only in DAZ and

Figure 5. Dendrogram showing the results of a cluster analysis when using only those words that appear in all chunks: the nine 450-word chunks of Daniel and one (ending) chunk of 422 words and the entire 1064-word poem Azarias (AZ). The word boundaries of each chunk are the same as those given in Figure 3.
Click for larger view
View full resolution
Figure 5.

Dendrogram showing the results of a cluster analysis when using only those words that appear in all chunks: the nine 450-word chunks of Daniel and one (ending) chunk of 422 words and the entire 1064-word poem Azarias (AZ). The word boundaries of each chunk are the same as those given in Figure 3.

[End Page 314]

Azarias, but nowhere else) could influence the final dendrogram. Figure 5 illustrates the results of this experiment.

Using this method, DAZ and Azarias both appear in the same clade, separated from the main body of the poem. However, this approach identifies Dan62251-2700 as being closer to Azarias than is DAZ, a result not replicated anywhere else in our research and not supported by examination of the texts. The dendrogram was exactly the same when we used words occurring in at least ten of the eleven chunks. These results suggest that using only those words that are found in most or all of the chunks could be useful, but such an approach is still not as accurate as the full-scale analysis. Using only those words which are in most or all of the chunks narrows the range of variation more than is likely desirable for the analysis of relationships. 34 However, the basic agreement of this method with the method that uses all of the words suggests that rare words (rare within these texts) are not particularly skewing the analysis.

From these controls and variations we conclude that an effective method for the comparison of poems to each other is cluster analysis with a chunk size of between 450 and 1500 words and with all words in the text included (no deletions of proper names or of rare words).

Detecting Divisions Within a Poem: Genesis A and B

Above we demonstrated how cluster analysis can be used to identify the section in a long poem that is most similar to a different poem. The success of the lexomic approach in correctly identifying the relationship between Azarias and the DAZ section of Daniel encouraged us to see if we could use a variation of that method to identify different sections of a single poem. The problem of detecting internal divisions in a poem is substantially different from that of identifying an external poem connected to an internal section of another poem. The similarity between Daniel and Azarias results from their being two variations of the same text. Their vocabulary is somewhat shared, and it is shared in a specific order, with the words in the first line of Azarias matching up with the words in line 279 of Daniel, line 2 of Azarias matching line 280 of Daniel, and so forth throughout the parallel passages. Divisions or sections within a single poem, however, can be marked by shared vocabulary that is not necessarily in an ordered fashion. For example, if we imagine an epic poem about [End Page 315] Three Little Pigs, the first 2000 words might include many repetitions of "straw," while words 2001-4000 might see no appearances of "straw" but many appearances of "sticks." Because cluster analysis is not significantly influenced by any single word and is instead influenced by the presence of variations in relative frequency of the entire vocabulary, those repetitions of "straw" and "sticks" are very unlikely to be enough by themselves to create similarities and differences yielding the correct relationships among segments. Nevertheless, finding divisions within a poem is a problem at least related to the one we had addressed in the Daniel / Azarias analyses, and we therefore decided to analyze the Anglo-Saxon poem Genesis to see if our lexomic methods distinguish Genesis A from Genesis B.

Like Daniel, Genesis is found in the Junius Manuscript, Oxford, Bodleian Library, Junius 11. 35 The poem is a partial paraphrase in Old English verse of the Biblical book of Genesis, from the Creation to the Sacrifice of Isaac. In 1875, the philologist Eduard Sievers noted that lines 235-851 of Genesis are significantly different in tone and style from the rest of the poem (ll. 1-234 and 852-2936). Using newly developed methods of Germanic philology, Sievers concluded that these lines, now called Genesis B, were a translation into Anglo-Saxon of an Old Saxon original, while the other lines of the poem, now called Genesis A, were a direct Anglo-Saxon translation of a Latin text. 36 Sievers's deduction was confirmed nineteen years later, when Karl Zangemeister discovered in the margin of a manuscript in the Vatican Library a fragment of an Old Saxon poem that matched some of the lines of Genesis B. 37

Figure 6. Dendrogram showing the results of a cluster analysis using ten 1500-word chunks of Genesis and one (ending) chunk of 2094 words.
Click for larger view
View full resolution
Figure 6.

Dendrogram showing the results of a cluster analysis using ten 1500-word chunks of Genesis and one (ending) chunk of 2094 words.

[End Page 316]

To search for Genesis B in Genesis, we first divided the poem into eleven chunks of 1500 words each. We then performed cluster analysis as discussed above on those chunks. Figure 6 shows the result. Clade β on the right-hand side of the dendrogram contains three chunks of Genesis, separated quite clearly from the rest of the poem. These three pieces, chunks 2, 3, and 4 of Genesis, correspond, respectively, to lines 262-460, 461-667a, and 667b-890 of the poem. Genesis B runs from line 236-881, so the dendrogram correctly identifies the three 1500-word chunks that are Genesis B by linking them to each other and separating them from Genesis A. Thus even though finding an internal division in a poem is a somewhat different task than identifying the section of one poem that corresponds to another, external poem, we were still able to use cluster analysis to find Genesis B.

When we examine Genesis in 1000-word chunks, we can even more clearly see Genesis B (chunks 2-6) as a cluster separate from Genesis A (see Figure 7). Even though approximately half of Gen21001-2000 is made up of Genesis A, the chunk nevertheless clusters with the rest of Genesis B. Note that clade γ, which is composed entirely of Genesis A, clusters together before any other chunks. Then, when clade γ has finished clustering, clade β clusters together all of Genesis B. The eighth chunk of the poem, Gen87001-8000 (ll. 1079-1256), now appears to be different from the rest of the poem; we will discuss this chunk in more detail later in this paper.

Analysis with a chunk size of 500 gave us somewhat more complicated results, but we were still essentially able to identify Genesis B within Genesis. There are four upper-level clades in Figure 10, three of which are composed mostly or entirely of Genesis B. The largest clade, γ in the center of the dendrogram, is composed entirely of chunks of Genesis A. If we separated clade γ from the other three major clades, we would have one grouping that was entirely Genesis A and another that was Genesis B

Figure 7. Dendrogram showing the results of a cluster analysis using 17 1000-word chunks of Genesis. Chunks on the far right (clade δ) are labeled with the word ranges of those sections of Genesis B. The left-most chunk 8 is labeled with line numbers rather than the word counts.
Click for larger view
View full resolution
Figure 7.

Dendrogram showing the results of a cluster analysis using 17 1000-word chunks of Genesis. Chunks on the far right (clade δ) are labeled with the word ranges of those sections of Genesis B. The left-most chunk 8 is labeled with line numbers rather than the word counts.

[End Page 317]

with a few additions. Clade d is made up almost entirely of Genesis B (the second chunk from the left, labeled A in Figure 10, contains 234 words of Genesis B and 266 words of Genesis A). Clade a contains three chunks of Genesis B and one chunk of Genesis A. Clade β, which does not contain any Genesis B at all, separates itself both from Genesis A and from the two Genesis B clades. This clade contains material from the 1000-word chunk, Gen87001-8000, which we will discuss later in the paper. It may be an anomaly or a section of Genesis A that is different from the rest of the poem.

Decreases in chunk size increase variation in vocabulary between texts, and this extra variation can account for the splitting of Genesis B into two chunks, the separation of clade β from the rest of Genesis A, and the inclusion of the extraneous chunk of Genesis A in clade α. Yet despite these anomalies, Figure 8 shows that the lexomic method closely derives the correct boundaries of Genesis B, grouping like with like. These results parallel those for Daniel and Azarias. We hypothesize that the difference between finding an external relationship and an internal division accounts for there being somewhat less resolution near the 500-word chunk size in the Daniel and Azarias results than there is in the Genesis results.

As a control for this analysis, we randomly permuted all the words in Genesis, scrambling the entire poem, and then cut it again into 1500-word chunks. We repeated this process several times and compared the dendrograms: it was clear that there were no repeated patterns in the arrangement of the chunks, which, based on the vertical distances in the dendrogram, were now all extremely similar. We can therefore conclude that our dendrograms for the unscrambled text of Genesis is based on the arrangement of the words within the chunks.

It is important to note that not only the chunk size, but the arrangement of the chunks can be significant when seeking internal divisions. If

Figure 8. Dendrogram showing the results of a cluster analysis using 500-word chunks of Genesis. Chunks that form sections of Genesis B are labeled as B and those from Genesis A are labeled as A. Two chunks of Genesis B are labeled with their respective line numbers from the poem.
Click for larger view
View full resolution
Figure 8.

Dendrogram showing the results of a cluster analysis using 500-word chunks of Genesis. Chunks that form sections of Genesis B are labeled as B and those from Genesis A are labeled as A. Two chunks of Genesis B are labeled with their respective line numbers from the poem.

[End Page 318]

the chunks do not match the boundaries of the division, it is possible for any given division to be more difficult to detect, as shared elements of the vocabulary are dispersed throughout the disparate material mixed into the section. We are working to develop methods for finding divisions in poems when we do not have the good fortune to have had Eduard Sievers identify the divisions for us. Researchers interested in seeking divisions should experiment with different chunk sizes, looking for patterns of arrangements in the dendrograms that appear consistently.

Our analysis accurately identified Genesis B at differing levels of resolution, in each case clearly separating those lines of the poem derived from an Old Saxon original from those derived from a Latin exemplar. Genesis B stands out clearly from Genesis A, and we therefore conclude that the lexomic method might be used to detect divisions in poems.

Guthlac A and B

To further extend this analysis, we examined the Exeter Book poem Guthlac to see if lexomic methods could detect separate sections in this poem. 38 Guthlac is found on folios 32v-52v of the Exeter Book. Although the hand is the same throughout, at line 819 (folio 44v) a very large capital eth begins a line of large capitals, suggesting a significant division in the poem. There is, additionally, at this point a shift in content: lines 1-818 (Guthlac A) deal with Guthlac's life and deeds, lines 819-1379 (Guthlac B) treat the Saint's death. Many scholars have also detected a change in style at this point in the poem, 39 and each of the two sections has a somewhat different relationship with Felix of Crowland's Latin Vita Sancti Guthlaci. 40 The relationship of the two sections to each other, therefore, is problematic, with many scholars concluding, along with Krapp and Dobbie, that "it may reasonably be inferred" that Guthlac A and B are not the work of the same poet. 41 [End Page 319]

To see if cluster analysis could detect the split between Guthlac A and B, we divided the text of Guthlac into eight approximately 1000-word chunks, 42 breaking the poem between lines 818 and 819. We then performed cluster analysis, producing the dendrogram in Figure 9.

Guthlac B is separated very clearly from Guthlac A, with B occupying all of clade ζ. Chunks GuthlacB21001-2000 and GuthlacB32001-3111 are the two chunks most similar to each other, with chunk GuthlacB11-1000 being next most similar to the chunk B2/B3 cluster. Although clearly separated from Guthlac B, Guthlac A does not cluster together as one large clade but instead demonstrates some more complex internal structure. We discuss the anomalous branch in the dendrogram, GuthlacA43001-4000, below and in more detail in another paper, 43 but for the purpose of this part of the argument, it is sufficient to note that lexomic analysis separates Guthlac A from Guthlac B and indicates that Guthlac B is more homogenous in vocabulary than Guthlac A. Both of these conclusions are consistent with traditional critical interpretations.

Christ I, II, and III

We next applied lexomic analysis to the problem of the divisions of the Exeter Book poem Christ, the structure of which has been a long-standing problem in Anglo-Saxon scholarship. Christ appears on folios 8r-32r of the Exeter Book. A decorated initial and line of capital letters on folio 14r appears to indicate one division in the poem, as does a similar decorated

Figure 9. Dendrogram showing the results of a cluster analysis using five approximately 1000-word chunks of Guthlac A and three approximately 1000-word chunks of Guthlac B.
Click for larger view
View full resolution
Figure 9.

Dendrogram showing the results of a cluster analysis using five approximately 1000-word chunks of Guthlac A and three approximately 1000-word chunks of Guthlac B.

[End Page 320]

initial and line-in capitals at 20v, thus dividing the poem into three parts: Christ I (ll. 1-439), Christ II (ll. 440-866) and Christ III (ll. 867-1664). Christ I treats the Advent, Christ II the Ascension, and Christ III the Last Judgment. Cynewulf's runic signature appears in lines 797-807a (fol. 19v). 44 At least since 1853, 45 critics have been discussing the problem of the "Unity of the Christ," as Albert S. Cook terms it in his 1900 edition. 46 Sievers was the first to argue that there were three separate poems and that Cynewulf might not be the author of all of them, concluding that Christ I and II came from a different period than Christ III and that if Cynewulf was the author of all three, each must still be regarded as independent of the others. 47 Subsequent critics argued both sides of the question, intertwining stylistic analysis with discussions of authorship. "Current scholarship prefers skepticism to speculation, and few would now grant to Cynewulf any but the poems bearing his runic signature," writes Fulk, 48 and indeed contemporary scholarship appears to have settled on a consensus that there are three Christ poems put together in the Exeter Book, with only Christ II attributable to Cynewulf. However, this conclusion may be based less on a skeptical attitude (pace Fulk), and more on a realization that the ingenuity of scholars can explain nearly any grouping of texts as being artfully, theologically, or politically significant. The linking of the Advent, Ascension, and Last Judgment, then, although it can be justified in a number of ways, does not make a compelling case for unity. Contemporary scholars are more influenced by less subjective criteria, particularly metrical or grammatical analyses. In these terms, Christ II does seem different from Christ I and Christ III and closer to the other runic signature poems.

To see if lexomic analysis would parallel the evidence of the manuscript (the decorated initials and the lines of all capitals), text (the content), and critical analysis (the style of the three parts), we divided all of Christ into 1000-word chunks, slightly adjusting the size of the chunks so that the boundaries of each section were coincident with the boundaries of chunks (we did not want to have a single chunk made up of 50% Christ I and 50% Christ II). We then used cluster analysis to construct the den-drogram given in Figure 10.

This dendrogram is somewhat more complicated than those constructed for the previously discussed texts, but close inspection reveals that, [End Page 321]

Figure 10. Dendrogram showing the results of a cluster analysis using ten approximately 1000-word chunks of Christ I, II, and III.
Click for larger view
View full resolution
Figure 10.

Dendrogram showing the results of a cluster analysis using ten approximately 1000-word chunks of Christ I, II, and III.

once again, the lexomic methods correctly arranged a set of Anglo-Saxon texts. Despite the complexity of the dendrogram, we see clear separation between three groups of chunks. Clade θ contains all of Christ I, clade η all of Christ II, and clade θ contains three of the five chunks of Christ III. As indicated by vertical distance from the bottom of the dendrogram, chunks 1 and 2 of Christ I (in clade ζ), chunks 1 and 2 of Christ II (in clade η), and chunks 2 and 3 of Christ III (in clade ι) are very similar within their clade and significantly different from the text in the other clades. The only complications are chunks 5 and 4 of Christ III: chunk 5 is as similar to Christ II as it is to the rest of Christ III (though not particularly similar to either), and chunk 4 is the most dissimilar of any chunk of the poem. Below we discuss possible reasons for the anomalous behavior of chunks 4 and 5 of Christ III, but for the time being it is enough to note that, despite the behavior of these chunks, lexomic analysis correctly separates Christ I and Christ II as well as most of Christ III. The analysis also supports the critical views that Christ II is more unified in style than Christ III. We therefore conclude that lexomic methods can be used to find subdivisions within Old English poems even when these are more contested and problematic than the divisions within Genesis or Guthlac. The dendrogram is not inconsistent with an analysis that Cynewulf is the author of only Christ II, because the two chunks of Christ II are closer to each other than to any other pieces of the poem, but at this stage of our research we cannot yet use this analysis to judge whether or not Cynewulf is responsible for all three parts of the poem or only Christ II: we do not have enough data on lexomic variation within a single author's corpus in Anglo-Saxon. Therefore we cannot yet conclude that only Christ II is by Cynewulf: we could just as easily be measuring differences in poetic performance by a single poet. Nevertheless, the methods do appear to identify correctly the major structural divisions of poems and suggest that conclusions drawn [End Page 322] from lexomic methodology are consistent with critical opinions about the internal consistency of each of the Christ poems.

Identifying Affinity: Guthlac B and Cynewulf

The success of the lexomic approach in identifying the relationship of Daniel to Azarias and the divisions of Genesis, Guthlac, and Christ suggests some additional experiments that might shed light on the affinities of different poems. Although contemporary scholarship accepts Cynewulfian authorship only for the four poems with the runic signatures (Christ II, Fates, Elene, and Juliana), critics have for a long time suspected that Guthlac B might also be by Cynewulf. 49 Fulk notes that only Guthlac B, Andreas, and the signed Cynewulfian poems include the phrase "ageaf ondsware," a formula metrically unusual in that it violates Kuhn's first law of sentence particles by demanding an anomalously unstressed verb. He also sees close parallels in vocabulary with the Cynewulf canon and adds that the evidence against a Cynewulfian affinity is "slender." 50 More recently, Andy Orchard has argued that, based on shared formulas and the analysis of parallel passages, Cynewulf was the author of Elene, Juliana, Christ II, Fates, and Guthlac B and that the author of Andreas knew the poems of Cynewulf. 51

Although at this stage of development lexomic analysis is not able to confirm authorship, 52 it may be able to shed some light on possible similarities [End Page 323] between Guthlac B and the signed poems of Cynewulf. Our previous analysis shows both that Guthlac B is quite distinct in vocabulary from Guthlac A and that the two chunks of Christ II are closer to each other than they were to the other pieces of Christ. We hypothesized, therefore, that if Guthlac B was by Cynewulf, lexomic analysis would show it to be closer to Christ II than it is to Guthlac A.

To test this hypothesis, we cut all of Guthlac A, Guthlac B, and the signed poems of Cynewulf into approximately 1000-word chunks. We then performed cluster analysis on these chunks to produce the dendrogram in Figure 11. Note first the very unexpected result that the first three chunks of Juliana (corresponding to ll. 1-533) form a clade separate from the rest of the poems; we explain this result at the end of the next section below. Less surprising, given our earlier work on Guthlac, is the outlier of chunk 4 of Guthlac A. 53 If we for the moment set aside these particular clades and examine the rest of the dendrogram, we see that there are two main clades: ε is composed of four chunks of Guthlac A and one of Guthlac B. The second clade, ζ, contains all of Elene, all of Christ II, Fates, the chunk of Juliana that contains the runic signature passage, and the second and third chunks of Guthlac B. Clade ζ, then, contains every passage that is definitely attributed to Cynewulf and no passages, except the last two thirds of Guthlac B, that are not generally thought to be by Cynewulf. This arrangement supports, although not unequivocally, the contention that Guthlac B is part of the Cynewulfian corpus, or at least that Guthlac B has affinities with the Cynewulfian corpus while Guthlac A does not. The presence of the first chunk of Guthlac B in clade ε, while at first glance anomalous, actually supports a recent hypothesis about the Guthlac poems and Cynewulf: Jane Roberts cautiously suggests that the first parts of Guthlac B are similar enough in content and style to Guthlac A that the possibility that the author of Guthlac B "had heard or read Guthlac A cannot be ruled out." 54 Orchard's work supports this conclusion, 55 as do Roy Liuzza's arguments. 56 If, arguendo, Cynewulf wrote Guthlac B as the ending of Guthlac A and so was more influenced by Guthlac A in the first 1000 words of Guthlac B than he would be later as he developed his own poem, we would expect to see a dendrogram somewhat similar to that in Figure 11, with the latter parts of Guthlac B grouped with those poems (or sections of poems) that include Cynewulf's runic signatures. 57

It is important, however, not to overinterpret this one dendrogram. Lexomic methods cannot, at this stage in their development, provide [End Page 324] unequivocal evidence for authorship attribution; lexomic analysis only indicates similarity, and we do not yet know how many different types of vocabulary (content words, proportions of function words, frequency of preposition use, among others) are contributing to a single metric. Furthermore, even were the results completely unequivocal, the lexomic method alone cannot justify linking similarity to authorship: that logical step (or leap) must be taken by the critic. 58 So although lexomic analysis supports the notion that Guthlac B is more like Christ II than it is like Guthlac A, even though Guthlac A is about the same subject matter and immediately adjacent to Guthlac B in the Exeter Book, much more work is required before we can take Figure 11 as an illustration of the Cynewulfian corpus. 59 Nevertheless, we can use Figure 11 as important additional

Figure 11. Dendrogram showing the results of a cluster analysis using 22 approximately 1000-word chunks of Guthlac A and B and the signed poems of Cynewulf (The Fates of the Apostles, Juliana, Elene, and Christ II). The exact word boundaries of each chunk are listed below. The * indicates those chunks that contain a runic signature.
Click for larger view
View full resolution
Figure 11.

Dendrogram showing the results of a cluster analysis using 22 approximately 1000-word chunks of Guthlac A and B and the signed poems of Cynewulf (The Fates of the Apostles, Juliana, Elene, and Christ II). The exact word boundaries of each chunk are listed below. The * indicates those chunks that contain a runic signature.

[End Page 325]

evidence (beyond that detailed by Roberts, Orchard, Liuzza, Fulk, and others) to support the inclusion of the second and third chunks of Guthlac B among those poems with strong Cynewulfian affiliation.

Dendrogram Geometry and External Sources

In the above discussion we have noted some anomalous or surprising den-drograms in which single chunks of a text appeared in single-leafed clades to one side of the dendrogram, indicating that these chunks were significantly different from the rest of the texts in the dendrogram. Chunk Dan31801-2700 of Daniel, chunk Gen87001-8000 of Genesis A, and chunk GuthA43001-4000 of Guthlac A all appear in this position, as does chunk ChristIII43001-4000 of Christ III. Chunk ChristIII54001-4703 has an analogous relationship with the remainder of Christ III (although not to all of the Christ poems the way chunk 4 of Christ III does). At first glance these dendrograms appear to be problems for lexomic methods, as they complicate the broader analysis discussed above. But upon closer examination we discovered that the presence of a single-leaf clade separated from the main trunk of a dendrogram was truly the exception that proved the rules discussed above, and we conclude that the presence of a chunk in a simplicifolious clade indicates that the chunk in question is likely to have a different source than the source of the main body of the poem. This dendrogram geometry, then, can be used to search for sections of poems that have outside sources, and it may shed some light on the composition methods of Anglo-Saxon poets.

When we cut Daniel into 900-word chunks and do not include Azarias in the dendrogram (Figure 12), chunk Dan31801-2700, which represents lines 299-455 of the poem, is separated from the rest of Daniel, an indication that this single-leaf clade is less similar to any of the other chunks of Daniel than all the other chunks are to each other. As these lines include most

Figure 12. Dendrogram showing the results of a cluster analysis using five approximately 900-word chunks of Daniel.
Click for larger view
View full resolution
Figure 12.

Dendrogram showing the results of a cluster analysis using five approximately 900-word chunks of Daniel.

[End Page 326]

of the portion of Daniel that is parallel with Azarias, this geometry was not immediately surprising until we noted that, given its compositional history, Azarias cannot have influenced Daniel. According to Remley's stemma, Azarias and Daniel have a common hypothetical archetype and only diverge from each other after significant development of Daniel, and although the poet who contributed to the closing lines of Azarias knew Daniel, there is no evidence that the author of Daniel knew Azarias. 60 Therefore something other than Azarias has caused Dan31801-2700 to be lexomically different from the rest of the poem.

For nearly a century many scholars, most significantly George Krapp, followed Hugo Balg in taking lines 279-439 (or 279-361) of Daniel as an interpolation into the main text. 61 This "Daniel B" approach was influential until the work of Robert Farrell, who argued that Daniel was a unified poem based on the Vulgate. 62 At first glance the lexomic analysis would appear to support the "Daniel B" hypothesis, separating as it does the chunk containing most of the putative "Daniel B" from the rest of the poem. However, the more recent work of Paul Remley shows that both the "Prayer of Azarias" (Vulgate, Daniel 3: 26-45) and all of the "Song of the Three Youths" (Vulgate, Daniel 3: 52-90) are derived in part from extra-Biblical sources. Lines 283-332 of Daniel are based not only on the Vulgate text, but on the Oratio Azariae, an independently circulating liturgical canticle that was originally based upon the Biblical text (although it has yet proved impossible to determine whether the particular text that influenced the Anglo-Saxon poem was Old Latin or the Vulgate). 63 Similarly, based on the "complete and precise agreement" of the verses of Daniel with the verses of the Canticum trium puerorum, Remley identifies this liturgical canticle as the source of lines 363-408 of Daniel, arguing that the extra-Biblical material was used by the poet to augment the Vulgate text. 64 In chunk Dan31801-2700 then, the poem has an exemplar or at least an influence in addition to that of the main exemplar (the Vulgate). Thus the single-leaf clade of Daniel has a different source than the main text of the poem.

By itself the connection of the geometry of the Daniel dendrogram with the presence of an outside source for the chunk in question cannot prove our conclusion that a simplicifolious clade indicates an external source, but when we combine the evidence of Daniel with that of Christ III and Guthlac [End Page 327] A, a pattern emerges. In Figure 10 above we showed that a dendrogram representing the cluster analysis of 1000-word chunks of the three Christ poems could correctly group Christ I and Christ II as well as most of Christ III. Two chunks, however, were anomalous, with the fourth chunk forming its own clade at the far left of the dendrogram and the fifth chunk being separate from the rest of Christ III (though still closer to the first three chunks of Christ III than to the other poems). Like chunk Dan31801-2700, this section of Christ III has a source different from the source of the rest of the poem. As Albert S. Cook first noted over a century ago, lines 1379-1498 of Christ III are an adaptation of Sermo 57 of Caesarius of Arles. 65 Cook's identification of the source has held up since then, being reaffirmed by Edward Irving and Frederick Biggs. 66 In this section of the poem, God addresses the damned in a long speech that begins at line 1373 and continues all the way to line 1523 (although only ll. 1379-1498 are adapted from Caesarius). A total of 775 words of chunk ChristIII43001-4000, therefore, are based on the Caesarius sermon (the chunk begins at l. 1350 and ends at l. 1510). Chunk ChristIII54001-4703 includes lines 1511-1664 of the poem. Lines 1499-1514 are considered to be based on Matthew 25: 42-45, 67 but Richard Trask noted that the Old English is somewhat different from the Biblical account, a difference that might be best explained by the influence of Caesarius' Sermo 157.5. 68

Our hypothesis that single-leaf clades to one side of the dendrogram are diagnostic of a chunk of the poem with an outside source can explain the positions of both the fourth and the fifth chunk of Christ III. The majority of chunk ChristIII43001-4000 (77%) is composed of material adapted from Caesarius of Arles, making this chunk significantly different from the rest of the poem and thus moving it to the extreme left-most position in the dendrogram. Only 10% of chunk ChristIII54001-4703, however, is composed of material adapted from Caesarius, thus moving that chunk only to the [End Page 328] left of the Christ III clade, not the entire dendrogram: the 90% of the fifth chunk that is not made up of adapted material is thus closer to the first three chunks of Christ III, which are not drawn from the Caesarius text.

Again, by itself this evidence is not dispositive, particularly since there are many other passages of Christ III that have external sources, and there is not a known single source for the first three chunks of Christ III, 69 but combined with the evidence of Daniel (discussed above) and Guthlac (discussed below), it supports the hypothesis that when a chunk is distinctly separated from the main trunk of the dendrogram, that chunk is likely to have an external source.

In Figure 11 above we showed that Guthlac separates quite clearly into Guthlac A and Guthlac B with the exception of GuthA43001-4000, which appears at the far left of the dendrogram, separated from all the other chunks of Guthlac. This particular chunk of the text corresponds to lines 499-676 of Guthlac A, the part of the poem where the demons drag Guthlac to the entrance of hell. The relationship between the Guthlac poems and their sources is complicated and somewhat debatable. 70 The Latin prose Vita Sancti Guthlaci was written by the monk Felix, probably soon after the saint's death in 714. 71 From this text derives almost all the other Guthlac

Figure 13. Representation of the relationship between chunks 4 and 5 of Christ III and their sources.
Click for larger view
View full resolution
Figure 13.

Representation of the relationship between chunks 4 and 5 of Christ III and their sources.

[End Page 329]

materials that circulated in Anglo-Saxon England, including Guthlac B. 72 There is also an Old English prose translation of Felix's Vita, which survives in a manuscript of the eleventh century. 73 Vercelli Homily 23 is an excerpt of that translation, comprised of chapters 28-32; 74 the content of Vercelli 23 corresponds roughly to the content of Guthlac A, particularly in its inclusion of the scene at the entrance to hell. The relationship of Guthlac A to Felix's Vita has not been firmly established, and critics have disagreed whether or not the poem might have been influenced by the prose Life or if the prose Life has been influenced by Guthlac A. 75 It is therefore of significant interest that the hellmouth episode contained in GuthA43001-4000 also appears in Vercelli homily 23. Although Vercelli 23 itself cannot be identified as the source of GuthA43001-4000, we can conclude that there is additional evidence, beyond the detailed analysis of previous critics, to suggest that this part of Guthlac A has a source different from that of the rest of Guthlac A, perhaps being influenced not only by the Latin Vita by Felix but also by an independently circulating Old English translation. Further detailed analysis is beyond the scope of this paper, and we are currently completing additional research to attempt to clarify the relationship of GuthA43001-4000 to the poem's sources, but for the purpose of the current argument it is enough to note that the anomalous geometry of the Guthlac dendrogram can be explained by our hypothesis that simplicifolious clades separated from the main trunk of the dendrogram indicate external sources for those parts of the text.

In Figure 7 we saw that in addition to the perfect separation of Genesis B from Genesis A, a single chunk, Gen87001-8000, appeared apart from the rest of the poem at the far left of the dendrogram. Based on the discussion above, we should expect this chunk, which is made up of lines 1079-1256 of Genesis A, to have a source different from the main source of Genesis A (the Vulgate), 76 but none of the major editions of Genesis has identified a specific external source for these lines. 77 However, upon closer examination [End Page 330] of the lines it becomes evident that they are stylistically significantly different from their neighbors and that there may have been at least some extra-Vulgate influence upon this part of the poem, for the lines contain information from medieval traditions not present in the Vulgate. The passage includes the genealogical lists from Adam to Noah (ll. 1055-1252) giving the lineages of both Cain and Seth. Here the poet attempts to clarify the somewhat confusing Biblical material, which puts Lamech in both lineages and also has Lamech as both the killer of Cain and the father of Noah. That in one role Lamech would be cursed and in the other blessed appears to be a difficulty for the Anglo-Saxon poet, as it was for other medieval theologians. 78 The poet solves this problem by making Lamech into two different people, the first, the killer of Cain and in the line of Cain, is spelled "Lameh" (ll. 1186, 1191); the latter, the father of Noah and in the line of Seth, is spelled "Lamech" (ll. 1225, 1236). If we could find the Latin source from which the Anglo-Saxon poet drew this material, we would have significant supporting evidence for our hypothesis, but, unfortunately, we have not been able to locate the specific source, though efforts are ongoing. Nevertheless, because this tradition does not appear in the Vulgate but does appear in the poem, we can have some confidence that for this section of the poem the poet used a source different from that of his source for the main text, though without identifying that source, it is difficult to know how much it might have contributed to the text as a whole: the poet may have only drawn the idea of having two different characters named Lamech.

We are not the first to suggest that this section of Genesis A might have an outside source. In 1900, basing his judgement on entirely different criteria, Hans Jovy suggested that lines 1055-1252 (almost identical to our chunk Gen87001-8000) and lines 1601-1701 could not be by the same poet as the rest of Genesis A because the author of the main body of the poem was too skilled to produce such inept poetry as was evident in the genealogies from Adam to Noah and from Noah to Abraham. 79 Jovy based this interpretation on what he saw as metrical and linguistic peculiarities in these sections of the poem, concluding that these two genealogical passages were written by a later reviser. 80 Jovy's conclusions, however, have not found favor with later editors, and neither Krapp nor Doane gives any [End Page 331] indication that there is any difference between these lines and the rest of Genesis A. 81 In fact, Doane praises the Sethite genealogy for its "remarkable fidelity of structure and nomenclature" that nevertheless exhibits "considerable variation and formulaic inventiveness." 82

John W. Butcher singles out the same passage as being somewhat different from the rest of the poem, though he does not consider it to be inept or the work of a later reviser. Focusing on lines 1104-1242a, he agrees with Doane as to the quality of the passage and argues that the poet, while remaining faithful to the Vulgate source, "reaches for the lexical variety and spontaneity of the old poetic diction to paraphrase this sacred text." Thus the Anglo-Saxon poet is reacting to the repetitive form of the Biblical source not by mirroring it but instead by developing some variation through the use of both tradition-wide and idiolectical formulas and by using epithets when there are none in the Latin source. The Genesis poet, Butcher asserts, "expanded and sometimes embellished the poem with original ideas and inventive diction not found in the traditional poetic corpus." 83

The stylistic difference of the passage, therefore, may be what is causing the clade to be single-leafed: our methods lump together various kinds of difference, and so we may be simply seeing that this section of the poem is—in an attempt to imitate the style of the Vulgate in this passage—different than the rest of Genesis A. However, most of Butcher's assessment is still perfectly applicable if the text of the passage in question has a different source than the remainder of the poem. The poet's expansion and embellishment does not necessarily have to have come from ideas that were original to the poet—they could just as easily have come from ideas found in an extra-Biblical source known by the poet. Such a source could have given the poet a model for how to handle the otherwise boring and tedious repetition of this part of Genesis, and it could also have given him a model for how to deal with the problem of the lineage of Lamech. Remley concludes his analysis of the Biblical sources of Genesis A by noting the genealogical summaries among the passages that might have Old Latin parallels, 84 and we would add to this only that the passage might have Old Latin parallels or another source or tradition used by the poet to augment his source text. Note that the entire passage cannot have been drawn entirely from the Vulgate, not only for the epithets and [End Page 332] formulaic invention discussed by Butcher, but because the two-Lamech solution developed by the Genesis A poet is not found in the Vulgate or any other version of the Bible that might have circulated in the Anglo-Saxon period but belongs instead to extra-Biblical tradition. Lexomic analysis cannot prove that Gen87001-8000 has a different source than the rest of Genesis A, but the lexomic evidence suggests this hypothesis and can be used to justify further research on the question. If further research can support the hypothesis (particularly if scholars are able to locate a specific source for the section), then our view of the composition of Genesis might be somewhat changed, with the Genesis poet composing somewhat in the manner of the author of Christ III, selecting extra-Biblical materials with which to complement his primary narrative. 85

We can now turn to the seemingly anomalous placement of three of the four chunks of Juliana in our analysis of the possibly Cynewulfian affiliation of Guthlac B. In Figure 11 chunks 1-3 of Juliana appear in clade β, separated from the other texts in the dendrogram. At first glance this result appears to cast doubt on the lexomic methods. By the evidence of the runic signature, Juliana is by Cynewulf. How could three chunks of the poem appear to be less closely related to the rest of the Cynewulfian corpus than even Guthlac A, which no one thinks is by Cynewulf? The explanation is that Cynewulf's poetry is no less susceptible to being perturbed by an external source than are anonymous poems like Daniel, Genesis, and Christ III. The difference in this case is that the Latin source was used for the majority of the poem rather than for a small section.

It has long been known that in writing Juliana Cynewulf used "a Latin text identical with, or similar to," the Life of St. Juliana published in the Acta sanctorum. 86 The poem follows the Vita closely, although it is closer in specific arrangement of scenes, and even sentences, at the beginning of the poem (where the similarity is very close indeed) and diverges more from the Latin text as it goes on. The final 216 words in the poem (ll. 695-731), however, are not a translation from the Vita but are a confessional passage in which the poet recalls his sins, hopes for intercession, and asks the readers of the poem to remember his name. This passage, which includes the runic signature, is Cynewulf's own invention, not based at all on his source. [End Page 333] Therefore this chunk of Juliana has, at least for these lines, a different source than the first three chunks: in this case the relationship is the inverse of that of Daniel or Christ III, with the majority of the poem coming from a Latin source and a small passage with a different source appearing in what is a single-leafed clade when Juliana alone is dendrogrammed. By itself, this only tells us that the chunk containing the runic signature passage does not have the same source as the majority of the poem, an obvious point, since the runic signature passage is not in the Vita. However, when we combine Juliana with the other runic signature poems, the chunk that contains the runic signature passage now appears in the very center of the clade of Cynewulfian texts. We interpret this placement to mean that chunk Juliana4 is more "Cynewulfian" than the passages of Juliana derived from the Latin Vita and that the dependence upon a Latin source of chunks 1-3 has so perturbed the dendrogram of those sections that the Cynewulfian affiliation of these chunks is obscured. If we had only Juliana4, lexomic methods would group it with the other Cynewulfian texts, as the methods do with the short, rune-signed The Fates of the Apostles. 87

That all chunks of Elene and Christ II appear in clade ζ in figure 11 indicates that these poems are more homogenously Cynewulfian and less directly dependent upon their Latin sources than is Juliana. Christ II, although it has as its chief source the 29th Homily of Gregory the Great (on the Ascension), is consistent throughout and not tied particularly closely to the Latin phrasing of that text, which is perhaps why the Gregorian homily was not immediately identified as a source by early scholars and why Albert S. Cook could convince scholars that Bede's Hymnum canamus gloriæ was also a source for Christ II: the Anglo-Saxon poem is not obviously dependent in style on the Latin source. 88 Similarly, although Elene is based on some version of the Vita Quiriaci, "the style of Cynewulf's poem is much fuller than that of the Latin legends," suggesting that the poem either is not as significantly influenced by the Latin source as is Juliana or was more thoroughly reworked by Cynewulf. 89

Thus the apparent exception of Juliana further proves the rule that lexomic analysis is able to indicate relationships among poems and to show—at least on some occasions—when one section of a poem has a different source than the others. At this time we do not know the precise [End Page 334] mechanism by which an external source perturbs the vocabulary or style of the poem. The influence could come through the direct influence of the style of a specific source (Daniel, Christ III, Guthlac A) or the cognitive demands of integrating an extra-Biblical tradition (Genesis A), or there could be some additional mechanism. But we can conclude with some confidence that the geometry of dendrograms could be used to screen the Old English corpus for sections of poems that have as yet unidentified external sources.

Conclusions and Caveats

The basic lexomic method of tabulating and counting the words in texts and then performing cluster analysis with those data can be used to characterize accurately the relationship between Daniel and Azarias. Similarly, the lexomic method identifies the division between Genesis A and Genesis B, Guthlac A and B, and Christ I, II, and III. Extension of the method to more complicated textual relationships between Guthlac B and the signed Cynewulfian poems produces reasonable results consistent with investigations using traditional methodologies. Seeming anomalies in the den-drograms are explained by the presence of external sources for those chunks that are substantially different from the main text of the poem. This hypothesis is supported by the results of analyses of Daniel, Christ III, and Guthlac A and not contradicted by the analysis of Genesis A (and in fact may account for some of the differing interpretations of the passage in question) and further substantiated by explaining the otherwise anomalous placement of the first three chunks of Juliana in the Cynewulfian dendrogram. We conclude, therefore, that the methods outlined above can be useful in analyzing Old English poems. Chunk sizes between 450 and 1500 words yield reasonable results, with chunks of approximately 1000 words striking a balance between fine resolution and lack of noise. Seeming anomalies in the dendrograms should be examined for the presence of possible external sources.

However, much work remains to be done. Most significantly, we do not yet know exactly what information is being captured by our distance metric and cannot tell if we are measuring content, style, or a mixture of the two. We also do not yet know if performing the analyses on lemmatized or otherwise-modified texts will make the results more or less useful. It is possible that modifications will have to be made to the method for prose texts or for poetic texts that are differently constituted. The methods also need to be adapted for screening large texts for smaller sections where there is not some independent evidence of where to look for divisions. For example, if 200 words of a text are from an outside source but are split between two [End Page 335] chunks, the 100 words in each chunk may not perturb the dendrogram enough to be visible to analysis. 90 We also need to develop a more mathematically rigorous way of determining minimum and optimum chunk size; probably this will entail calculating the heterogeneity of vocabulary within each text and determining how small a chunk can be before the dendrogram becomes too noisy. Finally, relationships indicated by lexomic analyses need to be examined using traditional philological methods: lexomic evidence on its own is not sufficient to determine relationships.

But despite these problems and caveats, lexomic analysis has shown itself to be a useful new weapon in the philologist's arsenal. Further development by other researchers, the use of more advanced statistical techniques, and additional work with more complex corpora will likely suggest additional modifications and refinements. The more these methods are applied, the more sophisticated they will become, and it has not escaped our notice that lexomics has the potential to address difficult and controversial questions about texts in a wide range of genres, languages, and periods.

Michael D. C. Drout, Michael J. Kahn, Mark D. LeBlanc, and Christina Nelson
Wheaton College, Norton, Mass.

Appendix

In this appendix, we provide the details of the calculation of Euclidean distance.

Suppose that there are n words in the vocabulary we wish to examine. Label these words w1,..., wn. In text 1, we compute p 1, w1 as the proportion of word 1 in text 1, and so on until we compute p 1, wn , the proportion of word n in text 1. Similarly, p 2, w1 and p 2, wn represent the proportions of word 1 and word n in text 2, respectively. The squared Euclidean distance (dissimilarity between texts), then, is D 2 = (p 1, w1 - p 2, w1 )2 + ... + (p 1,wn - p 2, wn )2 which is the sum of the squares of the differences in the proportion of each word in the two texts. The measure of dissimilarity is then D = √D 2.

Suppose we are considering the vocabulary in four texts and there are only three distinct words in all four texts. (n = 3 is boring, but it is easy to compute "by hand.") Consider the proportionate use of these three words in texts 1 and 2:

Text 1 Text 2
w1 .7 0
w2 .3 .5
w3 0 .5

In this case, even though there are three words in all four texts' vocabulary, text 1 only uses words 1 and 2, while text 2 only uses words 2 and 3. Then the dissimilarity, D, is found as: D 2 = (.7 - 0)2 + (.3 - .5)2 + (0 - .5)2= .49 + .04 + .25 = .78 and D = √.78 ≈ .8832.

From this, we see how the proportional occurrence of each word in a text contributes to its measure of dissimilarity from another text's word usage. [End Page 336]

Footnotes

. The National Endowment for the Humanities supported this research with a Digital Humanities Start-Up Grant (HD-50300-08). Generous contributions from the Mellon Foundation and Wheaton College were instrumental in support of our initial collaboration. We would also like to thank Associate Provost Elita Pastra Landis, Betsey Dexter Dyer, Amos Jones, Neil Kathok, Sarah Downey, Yvette Kisor, and Scott Kleinman for their assistance, support, and encouragement. An anonymous referee for JEGP and Charles D. Wright significantly improved the paper with their questions and suggestions.

1. The bases are adenine, cytosine, thymine, and guanine, abbreviated as A, C, T, and G.

2. The parallels between the disciplines were noted at least as long ago as 1995 by the philosopher Daniel Dennett, Darwin's Dangerous Idea: Evolution and the Meanings of Life (New York: Simon and Schuster, 1995), pp. 136-39. For example, bioinformatics traces descent by shared error, is interested in hapax legomena, and seeks to reconstruct hypothetical original texts. B. D. Dyer, M. J. Kahn, and M. D. LeBlanc, "Classification and Regression Tree (CART) Analyses of Genomic Signatures Reveal Sets of Tetramers that Discriminate Temperature Optima of Archaea and Bacteria," Archaea, 2 (2007), 159-67.

3. The term was originally coined by Betsey Dexter Dyer in 2002 and first appeared in Genome Technology, 1.27 (November 1, 2002). Since this time, "lexomics" has appeared on the internet and in some publications without attribution.

4. There can be complexities. For example, the word "umborwesende" in Beowulf, line 46, could be a single compound or two simplex words "umbor wesende." Fr. Klaeber, Beowulf and the Fight at Finnsburg, 3d ed. (Lexington: D.C. Heath, 1950), p. 2; Joseph Bosworth and T. Northcote Toller, An Anglo-Saxon Dictionary (1898; repr., Oxford: Clarendon Press, 1954), p. 1088. The boundaries of a "word" are more difficult to define in Oral Traditional studies, where the word "word" can mean "utterance" and so is not fixed to what we think of as one word. John Miles Foley, Traditional Oral Epic: The Odyssey, Beowulf, and the Serbo-Croatian Return Song (Berkeley: Univ. of California Press, 1990), pp. 44-50. Such problems, however, are not particularly significant for the analysis we are performing on texts in which there is general and widespread agreement about the boundaries of the vast majority of the words.

5. The Dictionary of Old English can be accessed at http://www.doe.utoronto.ca/index.html; a subscription is required. The tools on the lexomics.wheatoncollege.edu website produce data about the corpus but do not distribute the corpus as a whole or quotations from the corpus. Researchers who download the scripts and wish to modify them or use them differently will need to purchase a copy of the Old English corpus directly from the Dictionary of Old English.

6. See Table 1 for a partial list of the words and associated counts for the poem Azarias from the Exeter Book, DOE #A03.003_Az_T00130.

7. Die Angelsächsischen Prosabearbeitungen der Benediktinerregel, ed. Arnold Schröer, Bibliothek der angelsächsischen Prosa (1885; repr. Darmstadt: Wissenschaftliche Buchgesellschaft, 1964). See also Angus Cameron and Roberta Frank, A Plan for the Dictionary of Old English (Toronto: Univ. of Toronto Press, 1973), pp. 121-22.

8. The analyses below discuss the results of experiments performed on unconsolidated texts, but we tested consolidated versions as well and found, in these particular cases, no differences in results.

9. Work is ongoing to determine the effect of lemmatization on lexomic analysis (see Scott Kleinman, Michael D. C. Drout, Michael Kahn, and Mark D. LeBlanc, "Lemmatization and Lexomic Analysis," forthcoming). It may well be that lexomic analysis on lemmatized texts may yield different valuable information from lexomic analysis on edited or diplomatic texts.

10. J. F. Burrows, "Questions of Authorship: Attribution and Beyond," Computers and the Humanities, 37 (2003), 5-32.

11. David L. Hoover, "Testing Burrows's Delta," Literary and Linguistic Computing, 19.4 (2004), 453-75.

12. Technically, the Perl scripts use a hash table of arrays. Interested readers are directed to the documented software for specifics.

13. K. Mardia, J. Kent, and J. Bibby, Multivariate Analysis (London: Academic Press, 1980).

14. R Development Core Team, R: A Language and Environment for Statistical Computing (Vienna: R Foundation for Statistical Computing, 2009), http://www.R-project.org.

15. We also investigated use of the Manhattan and Canberra metrics and found no discernable difference in the final clustering results.

16. The terminology is borrowed from evolutionary biology and was developed by Willi Hennig, Phylogenetic Systematics, trans. D. Dwight Davis and Rainer Zangerl (Urbana: Univ.of Illinois Press, 1966).

17. We give a more detailed explanation and a brief example in the appendix.

18. In our lexomic analyses, the number of words is quite large, so it is difficult for any single word to make two texts highly similar or dissimilar. Instead, it takes a great deal of commonality (or difference) in the proportionate use of a wide array of words to make for large similarity (or distance) between two texts.

19. This term is taken from botany and is perhaps most familiar from the many plant species named "simplicifolia."

20. Helmut Gneuss, Handlist of Anglo-Saxon Manuscripts: A List of Manuscripts and Manuscript Fragments Written or Owned in England up to 1100 (Tempe: Arizona Medieval and Renaissance Texts and Studies, 2001), no. 640; N. R. Ker, Catalogue of Manuscripts Containing Anglo-Saxon (Oxford: Clarendon Press, 1957; repr. with supplement, 1990), no. 334.

21. Dated s. x/xi, xi1 by Ker, Catalogue of Manuscripts, pp. 406-8. For a dating 960x990, see Leslie Lockett, "An Integrated Re-examination of the Dating of Oxford, Bodleian Library, Junius 11," Anglo-Saxon England, 31 (2002), 141-74.

22. All quotations of Daniel are taken from The Junius Manuscript, ed. George Philip Krapp, The Anglo-Saxon Poetic Records, 1 (New York: Columbia Univ. Press, 1931). Translations of Old English are our own. See also Daniel and Azarias, ed. R. T. Farrell (London: Methuen, 1974).

23. MS Exeter, Cathedral Library, 3501; Gneuss, no. 257; Ker, no. 116. Scholars assume this manuscript to be "an mycel englisc boc be gehwilcum þingum on leoþwisan geworht" described in Bishop Leofric's donation list. For additional discussion, see Patrick W. Conner, Anglo-Saxon Exeter. A Tenth-Century Cultural History (Woodbridge: Boydell, 1993), pp. 1-20.

24. All quotations of Azarias are taken from The Exeter Book, ed. George Philip Krapp and Elliott Van Kirk Dobbie, The Anglo-Saxon Poetic Records, 3 (New York: Columbia Univ.Press, 1936). Paul Remley proposes renaming the Exeter Book poem "The Three Youths," modifying Bernard Muir's title, "The Canticle of the Three Youths." The bibliographical problems created by these revisions in nomenclature outweigh any benefits of the greater accuracy of the title(s), so we retain the traditional name Azarias in our discussion. Paul G. Remley, "Daniel, the Three Youths Fragment and the Transmission of Old English Verse," Anglo-Saxon England, 31 (2002), 81-140; The Exeter Anthology of Old English Poetry: An Edition of Exeter Dean and Chapter MS 3501, ed. Bernard Muir, 2d ed., 2 vols. (Exeter: Exeter Univ.Press, 2000), I, 157.

25. Israel Gollancz, The Cædmon Manuscript of Anglo-Saxon Biblical Poetry, Junius XI in the Bodleian Library (Oxford: Oxford Univ. Press, 1927), pp. xc-xci.

26. See Krapp, ed., Junius Manuscript, pp. xxxi-xxxiii; Farrell, Daniel and Azarias, pp. 40-45; Kenneth Sisam, "Notes on Old English Poetry," Review of English Studies, 22 (1946-47), 257-68. Remley proposes a complex textual history with as many as eight stages of transmission between the common exemplar of the two poems and the version of Azarias in the Exeter Book. "Daniel, the Three Youths Fragment," p. 140.

27. If we had consolidated the texts, thorn and eth would both appear as thorn, and both appearances of þa would therefore be bolded.

28. If we had consolidated the texts, both words would be forced to and, and both appearances of the word would therefore be bolded.

29. The final chunk is slightly shorter simply because Daniel does not divide evenly by ten.The script, cutter.pl, provides options for dealing with the ending chunk size of a text that does not divide evenly.

30. Performing this experiment with Azarias as two 532-word chunks yielded the same results as the experiment with Azarias as one 1064-word chunk except that Azarias appears as two chunks rather than one; the relationship to the rest of Daniel was the same.

31. In our ongoing research we are working on determining an optimal chunk size based on a measure of the heterogeneity of the vocabulary of a given text.

32. We are also aware that both Burrows and Hoover restrict their analysis to the n-most frequent words in a corpus and often remove pronouns, dialogue, or words in which one subset (comparable to our "chunk") contains more than 70% of the words in the entire corpus. Such modifications of the data set appear to work for the problems addressed by Burrows and Hoover, but as yet we have not found them necessary, although we are performing additional experiments to see where such approaches might be valuable.

33. Although the method described by Treschow et al. is more complicated, it also works to reduce the influence of uncommon words. Michael Treschow, Paramjit Gill, and Tim B. Swartz, "King Alfred's Scholarly Writings and the Authorship of the First Fifty Prose Psalms," The Heroic Age, 12 (2009), http://www.heroicage.org/issues/12/treschowgillswartz.php.

34. For example, a method that reduced the data set by eliminating words not in all or most chunks would not be influenced by words that are hapax legomena. This may not be a relevant factor when dealing with Modern English texts but could be for works in Anglo-Saxon.

35. Quotations from Genesis are taken from The Junius Manuscript, ed. Krapp.

36. Eduard Sievers, Der Heliand und die angelsächsische Genesis (Halle: Lippert, 1875), pp. 6-17. As R. D. Fulk notes, William Conybeare had first noticed the difference between Genesis A and Genesis B, but he had not deduced the Old Saxon source. William D. Conybeare, ed., Illustrations of Anglo-Saxon Poetry (1826; repr., New York: Haskell House, 1964), p. 188; R. D. Fulk, A History of Old English Meter (Philadelphia: Univ. of Pennsylvania Press, 1992), p. 49.

37. There are in total 337 lines of the verse paraphrase of Genesis in the Vatican fragment, some of which correspond to lines 790-820 of Genesis B. The Vatican manuscript also contains 61 lines of the Heliand. Karl Zangemeister and Wilhelm Braune, Bruchstücke der altsächsischen Bibeldichtung aus der Bibliotheca Palatina (Heidelberg: Verlag von G. Koester, 1894).

38. The Exeter Book, ed. Krapp and Dobbie, pp. 49-88. The poem is also edited in The Guthlac Poems of the Exeter Book, ed. Jane Roberts (Oxford: Oxford Univ. Press, 1979).

39. Laurence K. Shook, "The Burial Mound in Guthlac A," Modern Philology, 58 (1960), 1-10; Shook, "The Prologue of the Old English Guthlac A, Mediaeval Studies, 23 (1961), 294-304. For the critical history of the poem, see Roberts, The Guthlac Poems of the Exeter Book, pp. 12-19.

40. It is universally agreed that Guthlac B is based on Felix's Vita. There is no such agreement about the source of Guthlac A or the relationship of that section of the poem to the treatment of the same material given in Vercelli Homily 23. The Latin Life is edited by Bertram Colgrave, Felix's Life of Saint Guthlac: Introduction, Text, Translation and Notes (Cambridge: Cambridge Univ. Press, 1956), pp. 26-44. For discussion of the manuscripts of this text and their relationships, see Jane Roberts, "An Inventory of Early Guthlac Materials," Medieval Studies, 32 (1970), 193-233. For Vercelli Homily 23, see The Vercelli Homilies, ed. Donald Scragg, EETS, o.s. 300 (Oxford: Early English Text Society, 1992), pp. 383-92.

41. Krapp and Dobbie, eds., The Exeter Book, p. xxxii.

42. The first four chunks of Guthlac A are 1000 words; the fifth is 823 words. The first two chunks of Guthlac B are 1000 words; the third is 1111 words.

43. Sarah Downey, Michael D. C. Drout, Michael J. Kahn, and Mark D. LeBlanc, "'Books tell us': Lexomic and Traditional Evidence for the Sources of Guthlac A," forthcoming in Modern Philology.

44. First noted by Kemble in 1840. John Mitchell Kemble, "On Anglo-Saxon Runes," Archaeologia, 28 (1840), 327-72.

45. F. Dietrich. "Cynevulfs Christ," Zeitschrift für deutsches Alterthum, 9 (1853), 193-214.

46. Albert S. Cook, The Christ of Cynewulf: A Poem in Three Parts: The Advent, The Ascension and the Last Judgment (Boston: Ginn and Company, 1900), p. xvi.

47. Eduard Sievers, "Zur Rhythmik der germanischen Alliterationsverses," Beiträge zur Geschichte der deutschen Sprache und Literatur, 12 (1885), 454-82.

48. R. D. Fulk, "Cynewulf : Canon, Dialect, Date," in The Cynewulf Reader, ed. Robert Bjork (New York: Routledge, 2001), p. 5.

49. Franz Charitius, "Über die angelsächsischen Gedichte vom. hl. Guthlac," Anglia, 2 (1879), 265-308; Matthias Cremer, Metrische und sprachliche Untersuchung der altenglischen Gedichte Andreas, Guðlac, Phoenix (Elene, Juliana, Crist). Ein Beitrag zur Cynewulffrage (Bonn: Carl Georgi, 1888); Frank J. Mather, "The Cynewulf Question from a Metrical Point of View," Modern Language Notes, 7 (1892), 193-213; Moritz Trautmann, Kynewulf der Bischof und Dichter, Bonner Beiträge zur Anglistik, 1 (Bonn: P. Hansteins Verlag, 1898), pp. 43-70. The nineteenth-century arguments are discussed by Cook, Christ of Cynewulf, pp. lxii-lxiii. For more recent discussion, see Kenneth Sisam, "Dialect Origins of the Earlier Old English Verse," in his Studies in the History of Old English Literature (Oxford: Oxford Univ. Press, 1953), pp. 119-39. See also The Guthlac Poems of the Exeter Book, ed. Roberts, pp. 60-62; and Fulk, "Cynewulf: Canon, Dialect, and Date," pp. 5-6. In 1900 Albert S. Cook wrote: "The Guthlac is perhaps the dullest of Old English poems, or at least of the longer ones, so that it cannot even sustain a comparison with Juliana. For this reason one would be tempted to affirm that Cynewulf could have had nothing to do with it. Yet Kemble, Thorpe, Dietrich, Grein, Riger, Sweet, Ten Brink, Lefèvre, D'Ham and Brook all assign it to him" (The Christ of Cynewulf, p. lxii).

50. Fulk, "Cynewulf: Canon, Dialect, and Date," p. 6; and see also Fulk, A History of Old English Meter.

51. Andy Orchard, "Both Style and Substance: The Case for Cynewulf," in Anglo-Saxon Styles, ed. Catherine E. Karkov and George Hardin Brown (Albany: State Univ. of New York Press, 2003), pp. 294-96.

52. We must also admit that we are wary of being drawn into authorship debates at this stage of our research. Pioneering work in computer-based stylistics has focused rather intently on problems of authorship perhaps to the exclusion of other important issues.

53. Discussed in more detail below.

54. The Guthlac Poems of the Exeter Book, ed. Roberts, p. 41.

55. Orchard, "Both Style and Substance," pp. 294-96.

56. Roy Liuzza, "The Old English Christ and Guthlac: Texts, Manuscripts and Critics," Review of English Studies, 41 (1990), 1-11.

57. Since Guthlac B is acaudate, the lack of a runic signature is not as significant as it would be if it were a complete poem.

58. Our research has been strongly influenced by Burrows's idea that computational stylistics is a "middle game technique" with much work using traditional methodologies needed both before and after the application of computational analysis. See also John Burrows, "Questions of Authorship: Attribution and Beyond," Computers and the Humanities, 37 (2003), 5-32.

59. We are currently engaged in research that may shed additional light on the problem.

60. Remley, "Daniel and the Three Youths Fragment," pp. 89-92, 114-15, 126-28.

61. The critical history is discussed in Paul G. Remley, Old English Biblical Verse: Studies in Genesis, Exodus, and Daniel, Cambridge Studies in Anglo-Saxon England, 16 (Cambridge: Cambridge Univ. Press, 1996), pp. 338-49; Hugo Balg, Der Dichter Cäedmon und seine Werke (Bonn: C. Georgi, 1882), p. 27. Krapp, ed., The Junius Manuscript, p. xxxii.

62. Farrell, Daniel and Azarias, pp. 23-29.

63. Remley, Old English Biblical Verse, pp. 356-70.

64. Remley, Old English Biblical Verse, pp. 392-404.

65. The Christ of Cynewulf, ed. Cook, pp. 210-11. For this sermon of Caesarius of Arles, see Patrologia Latina [PL], 39, 2207, and for a new edition, which in some senses is closer to Christ III, see Sancti Caesarii Episcopi Arelatensis Opera Omnia, ed. Germain Morin, 2 vols. (Bruges: Maretioli, 1937-42), I, 242-43. The relevant Latin passage is quoted by Cook, Irving, and Biggs.

66. Irving notes that the text of Christ III is closer to the Latin text printed by Morin in CCSL than the version cited by Cook and printed by Migne in the PL. Edward B. Irving, "Latin Prose Sources for Old English Verse," JEGP, 56 (1957), 588-95. See also Frederick M. Biggs, The Sources of Christ III: A Revision of Cook's Notes, Old English Newsletter Subsidia, 12 (Binghamton: State Univ. of New York at Binghamton, 1986), pp. 30-31.

67. Franz Dietrich, "Cynevulfs Christ" Zeitschrift für deutsches Altertum und deutsche Literatur, 9 (1853), 193-214.

68. Richard Trask, "The Last Judgment of the Exeter Book: A Critical Edition" (Ph.D.diss., Univ. of Illinois at Urbana-Champaign, 1972). See also Biggs, The Sources of Christ III, p. 33. Sermo 157.5 is printed in PL, 39, 1896 and CCSL, 104, 643.

69. Cook believed that an anonymous Latin hymn cited by Bede, "Apparebit repentina dies magna Domini," was the principal source of the poem; "Cynewulf's Principal Source for the Third Part of Christ," Modern Language Notes, 4 (1889), 171-76. Gustav Grau argued that the hymn could be a source for some of the poet's ideas, but that the parallels Cook adduced were not sufficient to show that the poem depended directly upon the hymn; Gustav Grau, Quellen and Verwandtschaften der älteren germanischen Darstellungen des jüngsten Gerichtes, Studien zur englischen Philologie, 31 (Halle: Max Niemeyer, 1908), pp. 48-52. And see Biggs, The Sources of Christ III, p. 1.

70. For discussions of date and cultural context, see Patrick W. Conner, "Source Studies, the Old English Guthlac A and the English Benedictine Reformation," Revue Bénédictine, 103 (1993), 380-413; Christopher A. Jones, "Envisioning the Cenobium in the Old English Guthlac A," Mediaeval Studies, 57 (1995), 259-91; Sarah Downey, "Too Much of Too Little: Guthlac and the Temptation of Excessive Fasting," Traditio, 63 (2008), 89-127.

71. Colgrave, Felix's Life of Saint Guthlac.

72. Jane Roberts, "An Inventory of Early Guthlac Materials," Mediaeval Studies, 32 (1970), 193-233.

73. Jane Roberts, "The Old English Prose Translation of Felix's Vita sancti Guthlaci," in Studies in Earlier Old English Prose, ed. Paul Szarmach (Albany: State Univ. of New York Press, 1986), pp. 363-79.

74. Scragg, The Vercelli Homilies, p. 381.

75. G. H. Gerould, "The Old English Poems on St. Guthlac and their Latin Source," Modern Language Notes, 32 (1917), pp. 77-89; Roberts, The Guthlac Poems of the Exeter Book, pp. 19-29; Orchard, "Both Style and Substance: The Case for Cynewulf," pp. 294-97.

76. Adolf Ebert, "Zur angelsächsischen Genesis," Anglia, 5 (1882), 124-33. Note that as early as 1885 Hönncher argued that the poem was not merely a straight paraphrase of the Vulgate source: Erwin Hönncher, "Über die Quellen der angelsächsische Genesis," Anglia, 8 (1885), 41-84.

77. Junius Manuscript, ed. Krapp; A. N. Doane, Genesis A: A New Edition (Madison: Univ.of Wisconsin Press, 1978), pp. 250-53.

78. See Oliver F. Emerson, "Legends of Cain, Especially in Old and Middle English," PMLA, 21 (1906), 874.

79. "So scheint es schon auf den ersten blick unmöglich, dass ein so begabter dichter, wie der verfasser der urbestandteile der Genesis zweifellos war, so geschmacklos sein konnte, uns in mehr als 200 versen die geschlechtertafeln von Adam bis Noah v. 1055-1252 (1285) und weider von Noah bis Abraham v. 1601-1701 vorzuführen." Hans Jovy, "Untersuchungen zur altenglischen Genesisdichtung," Bonner Beiträge zur Anglistik, 5 (1900), 5.

80. Jovy, "Untersuchungen zur altenglischen Genesisdichtung," pp. 8-9.

81. Krapp mentions Jovy's opinion that Beowulf is older than Genesis A. The Junius Manuscript, p. xxvi.

82. Doane, Genesis A, p. 251.

83. John W. Butcher, "Formulaic Invention in the Genealogies of the Old English Genesis A," in Comparative Research on Oral Traditions: A Memorial for Milman Parry, ed. John Miles Foley (Columbus, OH: Slavica Press, 1987), pp. 73-92. We are grateful to an anonymous referee from JEGP for directing us to Butcher's and Jovy's work.

84. Remley, Old English Biblical Verse, pp. 148-49.

85. Detailed analysis of the relationship of this section of Genesis to the rest of the poem is beyond the scope of this paper, although we are investigating the matter further. We note, however, that in very preliminary research we see similar evidence of one passage (ll.81-188) of Christ and Satan having an external source. If this were to be substantiated, we would have evidence of external sources for subsections for three of the four poems in the Junius Manuscript: Genesis, Daniel, and Christ and Satan. We have not yet investigated Exodus.

86. Krapp and Dobbie, eds., Exeter Book, pp. xxxvi-xxxvii; Johannes Bollandus et alii, "Acta auctore anonymo ex xi veteribus MSS," Acta Sanctorum Februarius II, Dies 16, 875-79, http://acta.chadwyck.co.uk/; a subscription is required. For the critical history of debate about the particular version of the Vita that Cynewulf used for the poem, see Claes Schaar, Critical Studies in the Cynewulf Group (New York: Haskell House, 1967), pp. 27-31.

87. In 1975, using a very different statistical methodology that was constrained by computing resources at the time, Sandra Harmatiuk concluded that the author of Juliana was not the same individual who composed the other poems with runic signatures. Sandra J. Harmatiuk, "A Statistical Approach to Some Aspects of the Style in the Signed Poems of Cynewulf" (Ph.D. diss., Univ. of Notre Dame, 1975), pp. 162-71. It is possible that her results were influenced by Juliana being more dependent upon its Latin source than the other signed poems.

88. Cook, The Christ of Cynewulf, pp. 115-22; Schaar, Critical Studies, pp. 32-34.

89. Schaar, Critical Studies, pp. 24-25.

90. There are sophisticated statistical techniques that might be employed to detect such perturbations, but they will need to be adapted to these particular problems.

Additional Information

ISSN
1945-662X
Print ISSN
0363-6941
Pages
301-336
Launched on MUSE
2011-07-23
Open Access
No
Back To Top

This website uses cookies to ensure you get the best experience on our website. Without cookies your experience may not be seamless.