- A New Algorithm for Extracting Formulas from Poetic Texts and the Formulaic Density of Russian Bylinas
Ever since the promulgation of Milman Parry’s formulaic theory of epic poetry in the works of Francis Magoun (1953) and Albert Lord (1960), the problem of estimating the formulaic density of a text or a group of texts has been frequently addressed in studies of oral and ancient poetry. Different estimates have been proposed for the formulaic densities of Homeric epics and hymns (Edwards 1988), Middle English alliterative poetry (Ward 1986), pre-Islamic Arabic poetry (Monroe 1972), and Russian bylinas (Uxov 1975; Arant 1990), to name but a few lines of enquiry. The early work in this field was characterized by widely divergent results (cf. 90% ratio of formulas in the Iliad proposed by Lord as compared to 57.5% of formulas in Homeric texts in Minton 1975), largely due to the methodology used by the scholars. Since they were not able to calculate the total number of formulas in a text of any considerable length, let alone a corpus of texts, the scholars resorted to sampling procedures. In most cases this meant choosing a representative body of verse lines and then calculating the ratio of lines in it found elsewhere in the same corpus. The advent of computer technologies providing easily searchable full-text corpora could have led to a new consensus in this field. However, the statistical study of formulas in poetic text fell into decline at the end of the 1970s, and the early computerized efforts in this field (Vikis-Freibergs and Freibergs 1978a; Strasser 1984) were not kept up.
In fact, as far as I am aware, Vikis-Freibergs and Freibergs’ 1978 paper on the formulaic density of Latvian folk songs about the sun is the only published attempt at formulating a strict procedure for counting formulaic density of a poetic text amenable to implementation as a computer program. The scholars maintain that their method, “while developed on one particular corpus . . . can be readily adapted to other computer-accessible [corpora]” (Vikis-Freibergs & Freibergs 1978a:330). This is indeed true, but, as I will show in the following, their algorithm is not entirely satisfactory as far as the extraction of formulas itself is concerned. I instead propose a new algorithm, one which is both more flexible and more powerful, and report the results of its application to a corpus of Russian bylina epics.
Vikis-Freibergs and Freibergs’ Algorithm
The algorithm used by Vikis-Freibergs and Freibergs to estimate the formulaic density of Latvian sun songs is rather straightforward. All the texts from the Montreal corpus of Latvian Sun-songs are put into a computer and stored in the memory (Vikis-Freibergs and Freibergs 1978b). Frequency counts are then taken of overlapping word pairs and word triplets from each line (that is to say, n-1 word pairs and n-2 word triplets are taken from a line of n words) and of larger non-overlapping units: single lines and line couplets. The extracted units are then arranged in the form of two printouts: the first presents the units in order of decreasing frequencies of occurrence, and the second gives an alphabetical listing.
The data obtained in this way can be used to estimate the formulaic density of a text understood as the proportion of units occurring at least n times in the corpus. Applying the program to the Latvian sun-songs and using 2 as the lower threshold, Vikis-Freibergs and Freibergs obtained the following results: of all the word pairs in the corpus 63.1% are formulaic, and the same holds for 46.7% of word triplets, 43.5% of lines (the last two evidently constitute a single formulaic level), and 20.8% of couplets.
Unfortunately the results obtained in this way, though in clear numeric form, are hard to interpret. Firstly, formulaic word pairs and triplets can overlap since, by design, overlapping word pairs and triplets are all counted independently. In order to avoid over-counting one must carefully scan all the resulting formulaic units, which moreover must be indexed as to their position in the corpus. The data of the last kind are not...