In lieu of an abstract, here is a brief excerpt of the content:

American Speech 75.3 (2000) 237-239



[Access article in PDF]

The Discipline

Data Mining

Allan Metcalf, MacMurray College

[References]

The great accomplishment of the twentieth century in American English was the amassing of data. The great accomplishment of the twenty-first century will be analyzing that data, developing techniques that in turn can be applied to the even vaster amounts of data becoming available on the Internet. We have much to do. [End Page 237]

The catalog of twentieth-century data is long, impressive, and well known to readers of American Speech: countless studies in Dialect Notes, American Speech, and PADS; the Linguistic Atlas of the United States and Canada, still under way in its many branches but already with huge amounts of data for most of the eastern half of the United States; the Dictionary of American English (1938-44) and Dictionary of Americanisms (1951); the great Dictionary of American Regional English (DARE 1985-) and the Random House Historical Dictionary of American Slang (1994-), both now at the letter O; and, at the end of the century, the Atlas of North American English (2000).

Truly, with all of these projects, the data has piled up faster than it could be mined. I can remember in the 1970s, when I was first introduced to the wonders of these vast studies, thinking how pitiful in comparison were the studies that analyzed them--impressive because based on so much data, but pitifully few.

And then, as the century approached its end, technology began to offer hope for mastering the data. In the beginning it was painfully slow; I remember the effort involved, as well as the triumph, just in getting Camwill to produce a phonetic typeball for the IBM Selectric typewriter.

Those involved in the giant projects were aware of the monumental challenge of interpretation and analysis. When the computer began to be available for storing and manipulating data, dialectologists were often at the forefront trying to make the most of it. Computers weren't entirely ready for DARE, or for the Linguistic Atlas of the Middle and South Atlantic States (LAMSAS 1980-), but their editors went ahead anyhow, pioneering new techniques at considerable cost.

Now computers and the World Wide Web are powerful enough to handle any amount of data with aplomb. They can even bring it to one's own desktop, as is already the case with LAMSAS materials at the University of Georgia. From anywhere in the world you can log on to the LAMSAS Web site and conduct your own search of its records.

The challenge for the twenty-first century is to develop sophisticated data-mining techniques worthy of the valuable material we have inherited. Which dialect differences are most significant? How can we best characterize regional, local, and ethnic dialects? What changes are occurring? We have enough data now, and enough computer sophistication, that we should be able to address questions like these with some confidence, if we can figure out how to ask the questions.

Thanks to the World Wide Web, the twenty-first century promises even vaster amounts of data to mine. Even today, a Web search offers opportunities previously undreamed of.

Here's a simple example. In their column in a recent ADS newsletter, the editors of DARE asked for help with ploye, "'a type of buckwheat [End Page 238] pancake'. This is supposed to be of French-Canadian origin and used in northern Maine. If you've heard it, please also indicate how it is pronounced" ("Rutz Around in Your Rumpelkammer" 2000). One click of a search engine led straight to the Web site of a farm in Fort Kent, Maine, with all the answers: "Ployes (rhymes with boys) have been favored by the Bouchard Family for many generations. Our Ployes brand mix closely follows the recipe created by the French speaking exiles from Nova Scotia after their arrival in Northern Maine's St. John River Valley in 1785."

The Web also is beginning to be enormous enough to allow testing of what we might call the "null hypothesis," that is, the question of whether an obscure word ever is actually used...

pdf

Share