- A Frequency Dictionary of Contemporary American English: word sketches, collocates, and thematic lists
Routledge's growing list of Frequency Dictionaries of modern languages may seem, at first glance, either a welcome novelty or a throwback.
Before the personal computer became ubiquitous, and even before the now-quaint terminal/mainframe kind of computing used before the PC, computational linguistics and corpus analysis still existed. But without networks or even floppy disks to distribute the output of these analyses, some lucky academics, with the support of some good-hearted academic presses, distributed the output (and often the raw input) of their proto-computational linguistics work in the original and natural medium for all knowledge: the printed book.
Few such books have aged well, and often the best place to find them is at library discard sales. It was tempting, at first, to lump Routledge's series in with this line of publishing, and to call them a series of printed books containing a frozen output of frequency-sorted wordlists that have come from corpus analysis. As corpus outputs go, the frequency-sorted list is one of the most basic and elementary: knowing the count of a token's occurrences is usually essential, but is often only the first step toward asking or answering more interesting questions.
Since Routledge's printed series is plainly not intended as an input to further processes within silicon-based computers, a purely corpus-analytic perspective on them is not useful. The Frequency Dictionaries are explicitly positioned toward learners of the given language. From postings on the Corpora-List (http://mailman.uib.no/listinfo/corpora), where several requests for authors/editors have appeared over the years, it seems that these dictionaries began as individually commissioned works without an overly specific central methodology. From a look at the other works in the series, it seems that the books come from isolated linguists using the best corpus and analysis they have available to generate the lists. There follows a moderate amount of presumably mixed editorial-computational [End Page 120] work to generate additional features, shared among many of the titles, to increase the value to learners.
The Frequency Dictionary of Contemporary American English benefits tremendously from this best-available approach. Davies' Corpus of Contemporary American English (http://www.americancorpus.org/) is large, balanced, and richly annotated as to the subject matter and mode of communication. The corpus itself is intended as a tool for learners, so in effect the FDCAE is the authors' guided tour through a powerful corpus-based learning tool. It can serve both as a standalone reference, and as a guide to getting useful information from the COCA or any other corpus tool.
The potential value to learners is tremendous. "Advanced Learners" of English have had recourse to special high-level learner dictionaries for over sixty years now, but the classification of "advanced learner" of other languages seems hardly to exist for the publishing world.
Given the size of the English-learner market compared to the market for learning any other language, not even the best-hearted publisher could be faulted for staying out of the money pit of commissioning a fullfledged ALD of any other language (with exceptions that shift with the political/economic winds). But Routledge may be in the middle of accomplishing something that is very good-hearted indeed: creating a whole line of quasi-ALDs that are rather inexpensive editorially, but offer a utility far greater than their profit margin.
But there are complications. For the same reason that we largely see only English ALDs, more-sophisticated corpus-analytical tools and trained models - for part-of-speech tagging, lemmatization, parsing, etc. - are most dependable for English. The availability and quality of tools for other languages can vary widely.
This leads to computational output that is only as sophisticated as the tools or trained models that are doing the analysis. For a language without a solid parsing model, you might only see part-of-speech tags and lemmas. For a language lacking adequate POS tagging...