- The Page ImageTowards a Visual History of Digital Documents
On September 30, 1991, over 200 researchers assembled in Saint Malo, France, to convene the first ever conference on "document analysis and recognition."1 The meeting brought together researchers from all over the world who for roughly the previous decade had been slowly changing the paradigm through which they approached the problem of the machinic understanding of the digitized page. Instead of thinking in terms of "characters" and "recognition," which underlay the long-standing field of Optical Character Recognition (OCR), they were gradually moving towards a more global and formal understanding of the page image as a whole. Researchers in the field of Document Image Analysis, or DIA as it came to be known, discarded the common assumption that the letter or the text was the ultimate referent of the bibliographic page. They focused instead on the heterogenous visual qualities of the page, or what they termed "the page image." "Document image analysis," writes George Nagy in a survey of twenty years of research in the field, is the "theory and practice of recovering the symbol structure of digital images scanned from paper or produced by computer."2 DIA researchers turned the page image into an analytical object.
In moving away from a text-centric understanding of the page, research in Document Image Analysis offers an important new way of thinking about the bibliographic page that is different from what has traditionally been the case in computational approaches to studying culture, but that has deep roots in the fields of book history, bibliography, and textual studies. Whether in the guise of "natural language processing" (NLP), "optical character recognition" (OCR), or "text mining," computational approaches to pages have remained heavily influenced by a text-centric mentality, using the page image as an (often imperfect) means to an end, an object to be passed through rather than studied as something potentially meaningful in itself. At the same time, the fast-growing field of "image analytics," which ranges from facial detection to the analysis of newspaper illustrations, has largely maintained the text-image divide that has long dominated the study [End Page 365] of culture. Images are seen as independent of texts, whether as stand-alone objects or paratextual "illustrations." Emerging computational approaches to studying the past thus recapitulate long-standing disciplinary divisions and in the process reinforce textuality as the ideal object of study when it comes to documents.
Despite their numerous positive scholarly affordances, such text-centric approaches to the computational study of documents can constrain how we think about the past. Our "machine-readable" coverage of the past (as opposed to machine-observable), for example, is deeply biased in terms of both time and space. Currently, usable text data from digitized page images reliably stretches in any representative way only back into the nineteenth century, omitting well over two millennia of human writing. Similarly, while improvements are being made every day, OCR techniques still favor a very particular type of Roman-based font, which omits non-Western print traditions like Chinese woodblocks, non-print traditions like medieval manuscripts, or even regionally eclectic print traditions like German Fraktur.
Second, the text-centeredness of many computational methods and techniques obscures the layers of technological mediation that produce and make digital documents available in the first place. Ryan Cordell has argued that we need to think more about how each digitized edition or OCR'd version of an historical edition is another "setting" of that text, bound by a similar set of historical conditions under which the initial print (or manuscript) object was initially produced.3 Like the particular printing press, house, and set of practices that governed the look and quality of a printed edition, OCR'd texts are similarly subject to particular machinery, institutional contexts, and human practices of correction and composition ("cleaning") that produce distinct outputs. Similarly, Matthew Kirschenbaum has been a vocal advocate for the physicality of born-digital documents, which are subject to the constraints of computing hardware.4 And we have argued elsewhere that digitized page images should not be seen as universal and disembodied––available to...