-
Statistical Methods for Identifying Local Dialectal Terms from GPS-Tagged Documents
- Dictionaries: Journal of the Dictionary Society of North America
- Dictionary Society of North America
- Number 35, 2014
- pp. 248-271
- 10.1353/dic.2014.0020
- Article
- Additional Information
- Purchase/rental options available:
Corpora of documents whose metadata includes GPS coordinates have recently become widely available through online social media such as Twitter. This has created opportunities for statistical corpus methods that describe the geographical spread of words, but such techniques do not appear to be widely used in corpus linguistics and lexicography. This paper presents several methods for describing the spread of a set of points, corresponding to documents containing a given word and applies the methods to a corpus of GPS-tagged tweets from Twitter. In experiments on known regionalisms, we show that these methods could be used to help identify such expressions. We analyze the words in the corpus identified as having the most geographically restricted usage and identify some expressions that appear to be previously undocumented regionalisms with highly localized usage.