New Words and Corpus Frequency
New Words and Corpus Frequency1 a; Ian Brookes Collins English Dictionaries t the time of writing (fall 2007) , I am in the process of selecting 1000 new words to be added to the text of The Chambers Dictionary. Since I do not (contrary to the assumption of many readers and journalists) have scope to add every word that has ever appeared in print, there must be some kind of selection process to determine which words are granted access to the dictionary, and which are politely asked to wait their turn. Moreover, since I neither select the words merely on the grounds of personal whim nor pick potential words at random in some kind of lexicographical tombola, there needs to be a systematic basis to this selection process. The Rationale for Frequency Lists What lexicographers do in such circumstances, of course, is to consult a corpus, and I am fortunate to have access to the Chambers Harrap International Corpus (CHIC), a corpus of over 400 million words of current English. To assist in the task of selecting new words, this corpus was used to rank all of the candidate words identified by the 1I am grateful to my colleagues Vicky Aldus and Katie Brooks for sharing their experiences of using frequency lists, and to Ruth O'Donovan, corpus development manager at Chambers Harrap, for developing the tools described in this paper. Dictionaries:Journal oftheDictionary Society ofNorth America 28 (2007), 142-145 _________________New Words and Corpus Frequency______________143 Chambers Harrap language-monitoring program2 by frequency, the assumption being that words that score highly on corpus frequency will be the strongest candidates for inclusion in a dictionary. However, even a casual glance at the resultant list reveals that it will not do simply to throw the top-ranking words into the dictionary (although such a course of action might have saved the editorial team a lot of time and heartache) . Some intervention is required to sort out the true lexicographic gold from the computer-generated dross. In this paper I will list a few of the things that lexicographers need to take into account when assessing computer-generated frequency statistics. In the process I hope that I will demonstrate that, although computers are extremely useful tools, there is still a place for human intelligence in the production of dictionaries. Assessing Frequency Lists of Candidate Words There are many reasons why a word that occurs frequently in a corpus may be excluded from a dictionary. Here are just some of the points that arose during the process of selecting words based on the evidence of the CHIC: • It is crucial to get a feel for the nature of the corpus that is being used. However large a corpus is, and whatever efforts have been made to give it balance in terms of subject matter and regional coverage, it will still have its foibles. Lexicographers should develop a feel for which types of word are well represented and which are less so. Indeed, to mitigate any idiosyncrasies of the CHIC corpus, we also ran the list of candidate words against another corpus and looked for words which occurred disproportionately more often in the second. • It is equally important to have a clear idea of the principles of inclusion for the dictionary on which you are working. This was particularly evident when dealing with terms automatically sThe Chambers Harrap language-monitoring program identifies candidate words through a combination of sources, including a directed reading programme and a "monitor corpus," which automatically scans recent texts for words not included in the dictionary. 144Ian Brookes identified by the monitor corpus, but it is also true of some words suggested by readers. Among the items proposed for inclusion were several items which might fall foul of the dictionary 's traditional selection criteria. The Chambers Dictionary, for example, tends to disregard many two-word compounds (for example, content provider, job market) on the grounds that they are deducible; other dictionaries might be more generous in including such material. One of the principles for adding new words to The Chambers Dictionary is that words should not only be established in use, but should also be used over a relatively broad area. The...