Computational methods for uncovering reprinted texts in antebellum newspapers

DA Smith, R Cordell, A Mullen - American Literary History, 2015 - academic.oup.com
DA Smith, R Cordell, A Mullen
American Literary History, 2015academic.oup.com
The Viral Texts Project (http://viraltexts. org) is an interdisciplinary and collaborative effort
among the authors listed here, with contributions from project alumni Elizabeth Maddock
Dillon, Kevin Smith, and Peter Roby. In the first iteration of the project, we focused on pre-
Civil War newspapers in the Library of Congress's Chronicling America online newspaper
archive (http://chroniclingamerica. loc. gov/), in large part because its text data are openly
available for computational use. The pre-1861 holdings comprise 1.6 billion words from …
The Viral Texts Project (http://viraltexts. org) is an interdisciplinary and collaborative effort among the authors listed here, with contributions from project alumni Elizabeth Maddock Dillon, Kevin Smith, and Peter Roby. In the first iteration of the project, we focused on pre-Civil War newspapers in the Library of Congress’s Chronicling America online newspaper archive (http://chroniclingamerica. loc. gov/), in large part because its text data are openly available for computational use. The pre-1861 holdings comprise 1.6 billion words from 41,829 issues of 132 newspapers. Many of the 132 newspapers included in this study are, in fact, iterations of continuously published entities that changed names or other qualities during their runs; we describe the way we grouped these publications into newspaper “families” in Section IV. We chose 1861 as our cut-off date not as a periodizing statement, but instead to demarcate a limited set of newspapers for our initial tests. The later in the nineteenth century one looks, the richer the Chronicling America archive; while our corpus includes some issues from the 1830s, then, the bulk of it comes from the 1840s and 1850s (Figure 1).
While the Chronicling America database is large, there are significant gaps in its holdings and problems with its data that influence our results. Chronicling America is not a single digitization project, but instead aggregates historical newspapers digitized through grantfunded, state-level projects. 1 As such, there are statewide gaps in the archive where particular states have yet to participate. Currently, for
Oxford University Press