In lieu of an abstract, here is a brief excerpt of the content:

Monitoring the Stability of a Growing Organic Corpus, with special reference to Sepedi and Xitsonga D. J. Prinsloo and Gilles-Maurice de Schryver i; "? this article it is assumed that the modern lexicographer .who is drafting the plan of especially a new dictionary for a particular language will have electronic corpora at his/her disposal. An electronic corpus can be defined in an oversimplified way as a computerised collection of texts. The total number of running words, or "tokens," in all those texts together, corresponds to the size of the corpus , while the total number of different words are referred to as "types." For instance, a small corpus consisting of 10 chapters of a Ghent University annual report has 5,502 tokens, but only 1,246 types. The lexicographer then utilises sophisticated computer interface packages , called "corpus query tools," to analyse such a corpus in various ways. Firstly, the total number of occurrences of a particular word, the "raw frequency," and the spread of that word across texts are studied. For example, from Table 1, which shows the top 10 types in the annualreport corpus, one sees that the word university has a total frequency of 68 (which corresponds to 1.24% of all the tokens in the corpus). The spread or distribution of university across the different texts is shown in Table 2. From Table 2 one sees that the spread ofwords can be studied across texts (16 times in chapter 8, 6 times in chapter 2, etc.) and text-internally (each stroke in the dispersion plots corresponds to an occurrence of university) . Raw, across-text and text-internal frequencies assist in making decisions either for inclusion of data in or for omission from the dictionary. Although frequencies adjusted by text distribution enable a more detailed analysis, we will use raw frequencies Dictionaries:Journal ofthe Dictionary Society ofNorth America 22 (2001) 86D. J. Prinsloo and Gilles-Maurice de Schryver Table 1 Top 10 types with their total or 'raw' frequency # WordFreq.% 1 the4728.58 2 of2043.71 3 and1903.45 4 in1893.44 5 to951.73 6 for921.67 7 a 841.53 8 university681.24 9 research 641.16 10RUG 530.96 throughout this article in order not to burden our reasoning unnecessarily . Secondly, a corpus query tool generates concordance lines from the corpus in which the word under study is put into context. As an illustration of the latter, consider Table 3 which displays a selection of concordance lines generated from a corpus consisting of letters from the principal to staff members of the University of Pretoria: In this example the word staffruns from top to bottom on the computer screen, with contexts given to the left and right of it. The lexicographer utilises such concordance lines not only to retrieve sense distinctions and to compile definitions, but also to find typical examples of usage, word clusters, collocations, etc. In the literature many debates have been conducted on the ideal size, balance and representativeness of corpora, as well as on how corpora should be structured to obtain such ideals. See, for instance, Summers (1993, 186 and 190; 1996, 6), Kilgarriff (1997, 150), Kruyt and Dutilh (1997, 230) and Kennedy (1998, 20, 52, 56-7, 62, and 73), for detailed discussions of these issues. In De Schryver and Prinsloo (2000) the authors opted, in terms of Atkins, Clear and Ostler (1992), for the concept of an "organic corpus": ... a corpus may be thought of as organic, and must be allowed to grow and live if it is to reflect a growing, living language. ... In order to approach a "balanced" corpus, it is practical to adopt a method of successive approximations. First, the corpus builder attempts to create a representative corpus. Then this corpus is used and analysed and its strengths and weak- Table 2 Spread of one word across texts and within the texts themselves 1 2 3 4 5 6 7 8 9 10 Sum Text Ch.08 Ch.02 Ch.04 Ch.05 Ch.03 Ch.01 Ch.06 Ch.07 Ch.09 Ch.10 # words Hits Hits per 1,000 Text-internal distribution 742 301 156 1,504 358 484 936 678 233 110 5...

pdf

Share