In lieu of an abstract, here is a brief excerpt of the content:

  • IceMorph:An Automated Morphological Analyzer and English-Language Lookup Tool for Old Icelandic1
  • Timothy R. Tangherlini, Aurelijus Vijūnas, Kryztof Urban, and Peter M. Broadwell

Introduction

The advent of inexpensive computing and the creation of large machine-actionable corpora consisting of well-structured digital texts have made it possible to analyze and mark for morphosyn-tactic features significant amounts of text (> 1,000,000 tokens) with a high degree of accuracy (> 80 percent) rapidly and automatically. Although the problem of automatically tagging text with part-of-speech [End Page 425] (POS) information has been largely solved for languages with little morphonological complexity,2 more complex languages, such as Old Icelandic (OIc) and other ancient languages, continue to pose problems for automated systems. Despite these difficulties, rich morphosyntactic markup that includes lemmatization holds great promise for both linguistic and textual scholarship. Accurate markup would enable the development of sophisticated online study environments that allow researchers to perform complex searches, make comparisons across multiple texts, and generate calculations concerning word-use and syntactical patterns. Our work, focusing on Old Icelandic, confirms that even for morphonologically complex Indo-European languages, the information gain offered by automatic morphosyntactic analysis of texts, measured as the percentage of correctly tagged tokens, sentences, and complete texts over the extant corpus, offers a marked improvement over previously available hand-marked texts (Rögnvaldsson and Helgadóttir 2008, 2011).3

Even the most detailed and accurate indexes produced in the past centuries—such as Ordförrådet i de älsta isländska handskrifterna (Larsson 1891), which provides an accurate and exhaustive word-form index for a number of the oldest Old Icelandic manuscripts (ranging from late twelfth- to mid-thirteenth-century manuscripts)—offer only minimal coverage when compared to the very large number of extant Old Icelandic texts. For a researcher interested in the study of the entire Old Icelandic corpus (or a large sub-corpus of Old Icelandic literature), these early handbooks, no matter how accurately compiled, are of limited use. Unfortunately, it is not economically feasible to extend the earlier practice of manual encoding to a greater number of manuscripts; the manual compilation of handbooks is costly and requires tremendous amounts of time, expertise, and energy. The old “paper-and-pen” approach does not, to borrow a term from computer science, “scale” well.

A dream of many researchers in Old Icelandic is to be able to work with a large number of texts (and manuscript witnesses to texts)—or even a comprehensive corpus—that include the high level of morpho-syntactic detail of the early handbooks mentioned above. Similarly, [End Page 426] historical linguists (especially syntacticians) are eager to work with a much larger parsed corpus of Old Icelandic texts than is currently available. Recent work, such as that of the Icelandic Parsed Historical Corpus group (IcePaHC) (Wallenberg et al. 2011) is a major step toward making such resources available, as it provides a considerable number of texts tagged in a semi-supervised fashion, and moves us closer to a comprehensive parsed Old Icelandic corpus. Yet, it is unlikely that IcePaHC alone will provide adequate coverage for Old Icelandic textual research, in part because it is focused on the historical development of Icelandic up through the present, and in part because it provides limited lemmatization of the texts. As such, IcePaHC diverges from our project, which has as its sole focus the morphosyntactic analysis and lemmatization of Old Icelandic texts. We believe that the computational methods developed by our group can augment those of IcePaHC and others, and have the potential to not only extend the necessarily limited scope of the earlier historical handbooks, but also increase considerably the number of richly marked texts available to researchers.4

Automatic morphosyntactic analysis of Old Icelandic offers an efficient method for accurately tagging millions of tokens in the growing corpus of machine-actionable texts. Rögnvaldsson and Helgadóttir, for instance, estimate the total number of tokens in their target Old Icelandic corpus at ~1.6 million (Rögnvaldsson and Helgadóttir 2011, 67). This estimate is only a fraction of the overall Old Icelandic corpus, as their corpus does not include the poetic corpus, the Kings...

pdf

Share