In lieu of an abstract, here is a brief excerpt of the content:

Computing Business Multiwords: Computational Linguistics in Support of Lexicography C: DavidJost and Win Carus Computational and statistical techniques for identifying, 4normalizing, clustering, classifying, selecting, and defining multiwords can revolutionize lexicographical processes. These techniques rely on two basic ideas: (a) the basic statistical insight that a wellselected sample can serve as a surrogate for the larger universe from which it is drawn; and (b) the fact that, given such a well-selected sample , it is possible to derive other data that can guide the organization of both lexicographical processes and data itself. Almost no one questions the fundamental contribution of corpora and statistical techniques to contemporary lexicographical practice, as presented in Armstrong (1994), Biber, Conrad, and Reppen (1998), Kennedy (1998) and Thomas and Short (1996). Recent improvements in programming languages , computing power, and storage capacity now permit the use of hitherto prohibitively expensive and complex computational and statistical methods on ordinary desktop computers. The techniques presented in this article are not unique and are hardly specific to lexicography , as can be seen in Charniak (1993), Manning and Schutze (1999), and Woods, Fletcher, and Hughes (1986). What is novel is an approach to lexicography framed and guided by these techniques. In diis article we focus on their use for dictionaries of contemporary business English; these techniques, however, can be used in any dictionary work. This article will cover the very large classes of multiwords, mostiy in fact multi-part proper and common nouns. These techniques can be applied to idioms and phrasal verbs as well, but diese classes are less likely to produce new terms in business English dian multi-part nouns. Dictionaries:Journal oftheDictionary Sodety ofNorth America 24 (2003) 60 David Jost and Win Carus Examples of low-, medium-, and high-frequency business multiwords taken from the TREC Wall StreetJournal corpus, The American Heritage Dictionary, 4th Edition (2000) and Wall Street Words (1997) demonstrate the utility and limitations of various methods — patternmatching , statistical methods, and multiword generation used with concordance tools and enhanced by morphological analysis, stemming , part-of-speech tagging, parsing, lexical context, and semantic features — that we will describe as we treat them in the course of the paper. Statistical sampling and bootstrapping techniques, also to be described, demonstrate how to maximize the amount of information derived from a corpus with the minimum of lexicographic review. The following matrix frames the tasks before the computational lexicographer: Table 1 in lexicon & in corpus in lexicon & not in corpus in corpus & not in lexicon not in lexicon & not in corpus Ideally, all terms should be found in the lexicon and the corpus . If a term is found in the lexicon but either of low frequency in or absent from the corpus, this term is of doubtful significance. If a term is found with some frequency in the corpus and not in the lexicon, it is a candidate for inclusion. It is also important not to ignore terms found in the lower-right quadrant, since they raise questions about the comprehensiveness and representativeness of the corpus and lexicon. The discussion begins with the simplest techniques. Although simple, they have potentially substantial storage requirements. Take, for example, the validation (the presence and frequency in a corpus) of a multiword list of business terms selected from the American Heritage Dictionary, 4th Edition (AHD4) and Wall Street Words (WSW) . These multiwords were validated using a corpus of five years of Wall Street Journal (WSJ) data (1987-92), which also can be used to demonstrate how the use of these terms varies over time. Example 1 (all examples are found in Appendix 1) shows how multiword terms containing marketwary in frequency from 1987 to 1992. Such frequency analysis is simple to apply after the fact, but it provides information only about the upper quadrants of our matrix. It is harder to find words that you should have included but did not. These examples only show raw, unadjusted frequencies. One should add information on die distribution of diese terms over die doc- Computing Business Multiwords61 ument collection to produce an "adjusted frequency." It should be clear that 10 instances of a term found in one document are, all other things being equal, less valuable lexicographically than one term found 10 times...

pdf

Share