In lieu of an abstract, here is a brief excerpt of the content:

5 Metadata To answer those questions you need good metadata. —Geoff Nunberg This chapter offers a first example of how the macroanalytic approach brings new knowledge to our understanding of literary history. This chapter also begins the larger exploration of influence that forms a unifying thread in this book. The evidence presented here is primarily quantitative; it was gathered from a large literary bibliography using ad hoc computational tools. To an extent, this chapter is about harvesting some of the lowest hanging fruit of literary history. Many decades before mass-digitization efforts, libraries were digitizing an important component of their collections in the form of online, electronic catalogs. These searchable bibliographies contain a wealth of information in the form of metadata . Consider, for example, Library of Congress call numbers and the Library of Congress subject headings, which they represent. Call numbers are a type of metadata that indicate something special about a book. Literary researchers understand that the “P” series is especially relevant to their work and that works classed as PR or PS have relevance at an even finer level of granularity—that is, English language and literature. This is an abundant, if somewhat general, form of literary data that can be processed and mined. Subject headings are an even richer source. Headings are added by human coders who take the time to check the text they are cataloging in order to determine, for example, whether it is fiction or nonfiction, whether it is folk literature or from the English Renaissance , and in the case of American literature whether it is a regional text from the northern, southern, central, or western region. This type of catalog metadata has been largely untapped as a means of exploring literary history. Even literary bibliographers have tended to focus more on developing comprehensive bibliographies than on how the data contained within them might be leveraged to bring new knowledge to our understanding of the literary record. In the absence of full text, this bibliographic metadata can Jockers_Text.indd 35 1/11/13 3:05 PM 36 Analysis reveal useful information about literary trends. In 2003 Franco Moretti and I began a series of investigations involving two bibliographic data sets. Moretti’s data set was a bibliography of nineteenth-century novels: titles, authors, and publication dates, not rich metadata, but a lot of records, around 7,000 citations . Moretti’s work eventually led to a study of nineteenth-century novel titles published in Critical Inquiry (2009). My data set was a much smaller collection of about 800 works by Irish American authors.* The Irish American bibliography , however, was carefully curated and manually enriched with metadata indicating the geographic settings of the works, as well as the author gender, birthplace, age, and place of residence. Geospatial coordinates—longitudes and latitudes—indicating where each author was from and where each text was set were also added to the records. The Irish American database began as research in support of my dissertation, which explored Irish American literature in the western United States (Jockers 1997). In 2001 the original bibliography of primary materials was transformed into a searchable relational database, which allowed for quick and easy querying and sorting. The selection criteria for a work’s inclusion in the database were borrowed, with some minor variation, from those that Charles Fanning had established in his seminal history of Irish American literature, The Irish Voice in America: 250 Years of Irish-American Fiction(2000).Toqualifyforinclusioninthe database, a writer must have some verifiable Irish ethnic ancestry, and the writer’s work must address or engage the matter of being Irish in America. Because of this second criterion, certain obviously Irish authors, such as F. Scott Fitzgerald and John O’Hara, are not represented in the collection. Both of these writers, as Fanning and others have explained, generally wanted to distance themselves from their Irish roots, so they avoided writing along ethnic lines. Thus, the database ultimately focused not simply on writers of Irish roots, but on writers of Irish roots who specifically chose to explore Irish identity in their prose. Determining how and whether a work got included in the database was sometimes a subjective process. Some of the decisions made could, and perhaps should, be challenged. A perfect example is the classification of Kathleen Norris as a Californian. Norris was raised and began her writing career in and among the Irish community of San Francisco. After marrying, though, she moved to * The results of Moretti’s project were first...

Share