- Computational methods for corpus annotation and analysis by Xiaofei Lu
Corpus annotation refers to the process of introducing interpretive linguistic information into a corpus. In spite of earlier criticisms against annotation (e.g. Sinclair 2004a,b; see McEnery, Xiao, & Tono 2006:31–32, McEnery & Hardie 2012:153–57 for further discussion), there has been an increasing consensus that annotation enriches a corpus (McEnery & Hardie 2012:31) and represents ‘added values’ (McEnery et al. 2006:30), substantially broadening the range of research questions that corpora can help to address. Given the time and cost of manual annotation, a range of computational tools (many freely available) have been developed to automate or assist in semiautomatic annotation at the morphological, lexical, syntactic, semantic, and discourse levels. Xiaofei Lu’s book provides an up-to-date, hands-on, practical guide to using such computational tools in automatic corpus annotation and analysis.
The book comprises eight chapters. Ch. 1 sets the scene for the book by discussing its objectives and rationale, explaining the importance and necessity of corpus annotation in linguistic research, and outlining the book’s organization, with an overview of each chapter. L justifies here his decision to focus on tools that are accessed through command-line interfaces, the reason being that corpus annotation and analysis tools with a graphic user interface (GUI) or a web-based interface are usually quite intuitive or are accompanied by detailed user manuals.
Given the book’s focus on tools with command-line interfaces and the author’s assumption—correct, in my view—about the reader’s lack of prior experience with such things, Ch. 2 introduces the command-line interface and illustrates the basic commands for file system management (e.g. creating, renaming, moving, deleting files/directories), as well as some commonly used commands and tools for text processing (e.g. pattern matching and regular expressions).
Chs. 3 and 4 focus respectively on corpus annotation and corpus analysis at the lexical level. Ch. 3 introduces part-of-speech (POS) tagging (including tokenization and segmentation) and lemmatization, and provides step-by-step instructions for downloading, installing, and running the Stanford POS tagger and the TreeTagger via the command-line interface. Publicly available GUI versions of the two tools for use on Windows systems—as well as some other tools, such as the Stanford tokenizer for English, Chinese, and Arabic, the web-based CLAWS POS tagger, and the Morpha lemmatizer for English—are also briefly introduced.
Ch. 4 exemplifies how lexically annotated corpora can be analyzed, in the form of various types of frequency lists and n-grams (e.g. word form, POS, lemma, and their combinations), via the command-line interface. It also illustrates lexical richness analysis, that is, lexical density (in terms of the proportion of content words), lexical variation (in terms of the type-token ratio and its variants), and lexical sophistication (in terms of the proportions of words of different frequency bands), using various command-line interface tools and other Windows- or web-based tools that are publicly available.
Chs. 5 and 6 move the discussion from the lexical to the syntactic level by focusing on syntactic parsing and syntactic analysis, respectively. Ch. 5 discusses two grammar formalisms, namely phrase structure grammar and dependency grammar, and introduces two syntactic parsers based on these theories: the Stanford parser and Collin’s parser.
Ch. 6 consists of two parts. The first introduces some key concepts of tree relationships and provides a tutorial on downloading, installing, and running Tregex, a search engine that effectively queries parsed corpora to retrieve parse trees on the basis of tree relationships and regular expressions. The second reviews a range of metrics that measure syntactic complexity in first and second language acquisition research, and introduces a number of software tools that can be used to automate syntactic-complexity analysis based on such metrics.
Ch. 7 discusses the analysis of semantic fields and propositions at the semantic level, conversational acts at the pragmatic level, and coherence-cohesion and text structure at the discourse level. These types of corpus analysis...