Some Big Problems with Big Data
Some Big Problems with Big Data

As methods for performing research on large numbers of texts become more accessible to humanists—especially those studying periodicals, which were produced and consumed in such vast numbers that they invite study on a grand scale—we must continue to acknowledge the shortcomings of the data-sets on which they rely so that we don’t risk erasing historically marginalized voices from our presentations of literary history or flattening complexities that are crucial to understanding cultural history. The “gross reading” or macro approach to literary history, which typically uses computational methods to derive patterns and discovery from large numbers of digitized texts, often claims or suggests comprehensiveness but rests fundamentally on incomplete or flawed data sets and methodologies. If we are to make meaningful claims about cultural history, we have to confront the limitations of our methods.

Large-scale digitization projects include gaps, the extent and importance of which are unknown. Proponents of gross reading have gone so far as to [End Page 22] argue that we can now study literature in its entirety—that when we pose questions to databases of texts, we are actually studying all of what was published in a given time period, and therefore can make conclusive claims about literary history. But even if HathiTrust or Google Books scans every unique book in every research library, we are only getting what a series of people have found fit to curate. It’s easy to imagine ephemeral, lowbrow texts or locally published and distributed materials that institutions did not preserve because they did not consider them worthwhile. We know, for example, that libraries often removed advertising from bound periodicals to save shelf space for “important” content, much to the chagrin of present-day scholars who are interested in advertising as documentation of the cultural imagination. In my own work on race in nineteenth-century children’s literature, I have found tragic absences in the curatorial record. For example, the very first periodicals ever created expressly for black children, Amelia E. Johnson’s the Joy, published specifically for African American girls, and the Ivy, which published stories about black American history, both published in the late 1880s, are now seemingly lost. How many such texts, published on the margins of institutionally affirmed culture, are missing from the supposedly comprehensive record of big data? Today’s digital record is inherited largely from the curatorial decisions of the past, and bears the legacy of the sociopolitical omissions of previous decades and centuries. To gloss over these omissions by assuming that the digital record is comprehensive is to unreflectively affirm the dubious ethical and political history of institutional repositories. Sadly, the voices we most need to include in humanities scholarship are the ones most likely to be left out of text corpora.

Another complication in big-data treatments of digital texts is that the methodologies applied typically flatten all works as though they were equal contributors to the cultural record. A database of thousands or millions of works treats each of them as a single, unique instance. Not only does this bulldoze the complex publication records and varied forms of some works, but it also fails to account for vast disparities in the cultural reach of different texts. For example, one might compare two now-canonical texts: Moby-Dick sold just over 3,000 copies in Melville’s lifetime, whereas Uncle Tom’s Cabin sold 300,000 in its first year of publication in novel form.1 In a big-data view of nineteenth-century American novels, these books are considered equals, despite the fact that one sold more than 100 times the number of copies of the other. This is not an easy problem for digital humanists to solve: the sales figures for many novels are missing or incomplete. But it’s important to note that seemingly objective visualizations of nineteenth-century fiction are depicting a cultural landscape that is as divorced from the messy realities of material existence as, say, the idealized texts of mid-twentieth-century critical editions. This is a limitation of all textual scholarship, but it can be most sensitively addressed by a close examination of...