- Computational Methods for Uncovering Reprinted Texts in Antebellum Newspapers
I. The Viral Texts Project and Research Corpus
The Viral Texts Project (http://viraltexts.org) is an interdisciplinary and collaborative effort among the authors listed here, with contributions from project alumni Elizabeth Maddock Dillon, Kevin Smith, and Peter Roby. In the first iteration of the project, we focused on pre-Civil War newspapers in the Library of Congress's Chronicling America online newspaper archive (http://chroniclingamerica.loc.gov/), in large part because its text data are openly available for computational use. The pre-1861 holdings comprise 1.6 billion words from 41,829 issues of 132 newspapers. Many of the 132 newspapers included in this study are, in fact, iterations of continuously published entities that changed names or other qualities during their runs; we describe the way we grouped these publications into newspaper "families" in Section IV. We chose 1861 as our cut-off date not as a periodizing statement, but instead to demarcate a limited set of newspapers for our initial tests. The later in the nineteenth century one looks, the richer the Chronicling America archive; while our corpus includes some issues from the 1830s, then, the bulk of it comes from the 1840s and 1850s (Figure 1).
While the Chronicling America database is large, there are significant gaps in its holdings and problems with its data that influence our results. Chronicling America is not a single digitization project, but instead aggregates historical newspapers digitized through grant-funded, state-level projects.1 As such, there are statewide gaps in the archive where particular states have yet to participate. Currently, for [End Page E1] instance, Chronicling America includes no newspapers from Massachusetts, the research team's home state, and, of course, a major center of nineteenth-century print culture. Even those states that contributed were unable to scan most of their historical newspapers. The bulk of historical newspapers remain undigitized, if not entirely lost. In addition, the raw text data that underlie the Chronicling America collection is, at best, messy. The text was automatically transcribed by Optical Character Recognition (OCR) software as newspaper pages were scanned. OCR makes frequent mistakes, particularly when trying to recognize text on worn or damaged historical pages or across the closely printed columns of nineteenth-century newspapers. In a pilot study matching known texts of 35 poems and stories against the Chronicling America database, we measured an average character error rate of between 5% and 15%, which leads to an average word error rate over 25%. The sheer scale of these digitization projects makes hand correction impossible.
Click for larger view
View full resolution
Given these limitations, our analysis necessarily misses far more reprinted pieces than it identifies. When we follow a link from any of our clusters and open a given newspaper page in Chronicling America, we typically see other reprinted texts (often indicated by the conventional "from the…" formula at the beginning or end of the text) on the very same page as the reprinted text we automatically discovered. Often these texts claim to be reprinted from newspapers that are not part of our current corpus, which explains why we have missed them. Other overlooked reprints are perhaps due to exceptionally poor OCR. Finally, we are aware of limitations to our discovery [End Page E2] algorithm. Because of editorial changes from the nineteenth century and poor OCR from the twentieth and twenty-first centuries, the algorithm requires a text to be relatively long to be identified: a text must include enough 5-grams that can overlap with other texts and be recognized as a pattern, rather than just noise. While this method has produced impressive and reliable results from extremely messy data, we know that we are missing shorter reprinted texts, such as many lyric poems, that simply do not contain enough...