Abstract

Corpora of documents whose metadata includes GPS coordinates have recently become widely available through online social media such as Twitter. This has created opportunities for statistical corpus methods that describe the geographical spread of words, but such techniques do not appear to be widely used in corpus linguistics and lexicography. This paper presents several methods for describing the spread of a set of points, corresponding to documents containing a given word and applies the methods to a corpus of GPS-tagged tweets from Twitter. In experiments on known regionalisms, we show that these methods could be used to help identify such expressions. We analyze the words in the corpus identified as having the most geographically restricted usage and identify some expressions that appear to be previously undocumented regionalisms with highly localized usage.

pdf

Share