Book History in the Early Modern OCR Project, or, Bringing Balance to the Force
In lieu of an abstract, here is a brief excerpt of the content:

Book History in the Early Modern OCR Project, or, Bringing Balance to the Force

The Early Modern OCR Project (eMOP), funded by a development grant from the Andrew Mellon Foundation to Texas A&M University, is one of the first large-scale digital humanities (DH) projects to use book history as a solution to a digital problem. This project in a sense inverts the typical perspective of the relationship between book history and the digital humanities, in which DH projects are perceived as providing access to an otherwise-hidden or inaccessible past. Ours are not engines that crunch large amounts of bibliographical data; instead, they find their power source in that bibliographical data. In the case of eMOP, as this essay will discuss, the relationship [End Page 90] between the digital and the bibliographic is dialogic and reciprocal: while the project's goals, angles of approach, and ethos of interdisciplinarity are all characteristic of DH, it is only through an acknowledged utilization of book history scholarship and methods that the project's ends will be accomplished. Book history—our corner of eMOP—represents two foundational nodes of the project. In the first place, we are identifying specific, minutely variant typefaces in order to distinguish as best we can between the myriad versions of the standard Roman typeface in early modern books.1 Secondly, we are studying type founders and foundries to trace the flow of fonts into and through London. Through this research we hope to realize the goal of eMOP: the automation of a process by which trained optical character recognition (OCR) engines might more accurately "read" the images of early modern book pages in, for example, Early English Books Online (EEBO) and Eighteenth Century Collections Online (ECCO). Ultimately our work will be formalized in a database that serves as the hub of this automated OCR process: the printer and typographical data will act as a traffic cop of sorts, directing the properly trained OCR engine to read the appropriate page images. In fact, a central tenet of the Early Modern OCR Project is that training OCR engines to recognize the letterforms in specific font sets will improve the accuracy of the OCR output—the resultant text files—when these engines are called upon to scan page images printed in that typeface.

While this is only a cursory sketch of eMOP, it suggests that book history is central to solving the digital problem of using OCR software on early modern books. Before printing became more regularized by technological advances in the nineteenth century, and before English typography approached something of a standard, national identity in the early eighteenth century, the typefaces and their settings on the printed page were highly variable. For this reason one might think that the most advanced OCR engines of our day might be rather effective, as they pull from their expansive font libraries to recognize different characters. However, part of the complication is that a large number of characters found in early English printing—including the ubiquitous long s, scribal abbreviations borrowed from the Latin, and even characters derived from Anglo-Saxon, such as the thorn—are unfamiliar to non-expert readers and not present in OCR libraries. Another challenge lies in the range of typefaces used by printers, which were for most stretches of English printing imported from various other countries and so display characteristics adopted from these national traditions. It is still another kind of noise, however, that [End Page 91] makes them ineffective: because the engines cannot distinguish line divisions—cannot focus their field of vision, in other words—they are not able to discriminate between letterforms and the various other blots that cloud page images. On more modern, higher quality images, of course, the OCR is more accurate, but those that have been preserved in mass-digitization projects—and which therefore will be central to eMOP's automated process—were limited by the technologies of their historical moments. We will focus the engines on specified typefaces with the idea that giving them distinctive shapes to search out will improve results. In other words, the study of age-old technologies—books, movable type—can...