In lieu of an abstract, here is a brief excerpt of the content:

  • The Substantial Words Are in the Ground and Sea:Computationally Linking Text and Geography
  • Travis Brown, Jason Baldridge, Maria Esteva, and Weijia Xu

1. Introduction

In a recent Digital Humanities Quarterly essay on the future of geographical tools for scholarly research in the humanities, Tom Elliott and Sean Gillies speculate that by 2017, "all web-facing textual resources will be parsed (rightly or wrongly) for geographical content" (par. 13). Most of this geoparsing, they argue, will be done by "the search engines," which will automatically identify names and descriptions of places and match them with "coordinate data in reference datasets." Only a small body of texts produced by "academics and specialist communities" will be annotated with more sophisticated geographical information. Elliott and Gillies provide a compelling outline of the dangers and difficulties of relying on companies like Google for these kinds of research tools, and their own project, Pleiades, is a good example of a curatorial tool that members of a specialized community (in their case, classics scholars) can use to develop authoritative geographical resources. In this essay, however, we will propose a different model for the future of digital geography—and possibly digital projects more generally—in the humanities. Our more optimistic projection is that the next several years will see a narrowing, not a widening, of the divide between massively automated (and often wildly inaccurate) endeavors like Google Books, on the one hand, and scholarly projects based on professional curation, like Pleiades, on the other. We will present our own work on TextGrounder,1 a geoparsing system that learns relationships between words and places from large bodies of unannotated text, as one step toward this convergence.

Anyone who has ever looked at the "places mentioned in this book" list for a Google Books text has seen examples of the many ways that automated [End Page 324] geographical annotation can fail. Elliott and Gillies find mislabeled references to Syracuse, New York, and Tempe, Arizona, in an 1854 translation of Xenophon's Anabasis, for example, and a glance at the map for the Oxford World Classics Edition of the King James Bible similarly turns up Bethesda, Maryland; Abilene, Texas; Dothan, Alabama; and a Bathsheba in the Caribbean. The problem is that place names are highly ambiguous: even in a monolingual context, a single place name, or toponym, may refer to dozens or even hundreds of different locations or regions on the earth. The difficulty of selecting the correct location or region for a given toponym —a task often referred to as toponym resolution—is compounded by the fact that individual documents may cover a wide geographical scope, and that individual toponyms may refer to many different kinds of geographical or geopolitical features or entities. "Washington," for example, is most prominently the name of a state in the US and the nation's capital, but it also refers to hundreds of towns, cities, counties, lakes, mountains, and streets around the world. One might expect the vocabulary of a document to provide clear clues as to the distribution of its geographical reference, but making use of non-toponym context in toponym resolution has proven to be difficult for a number of theoretical and practical reasons, and many current toponym resolution systems focus primarily on the place names in a document, in some cases ignoring other expressions entirely.

The toponym resolution system used by Google for the visualizations in Google Books seems to be particularly straightforward: it possibly does nothing more than choose the most prominent or populous entry in a database that maps toponyms to coordinates on the surface of the earth. We might additionally speculate, given the kinds of errors made by the system, that Google's algorithm gives places in the US priority over the rest of the world. The value of the geographical analysis and visualization in Google Books is undermined for many users by the fact that Google does not currently provide information about the methods used to create these maps, but the data and mapping tools provided by Google have been widely useful as a platform for other projects with their own analytical tools.

One example of a more sophisticated approach that builds on Google's resources...

pdf