In the movie Back to the future (1985), the main character is transported back thirty years into the 1950s, but the knowledge and experience that he took from the 1980s provide for a happy ending. This is somewhat analogous to the two recent corpora under discussion: The British component of the International Corpus of English (ICE-GB) and the Diachronic Corpus of Present-Day Spoken English (DCPSE). These corpora are reminiscent of corpora from thirty to forty years ago, when million-word corpora were the norm. In terms of relationships to older corpora, it is also interesting to note that half of the DCPSE actually IS a corpus from the late 1960s to early 1980s. But as in the movie, these corpora have been (re)structured and annotated in ways that make them much more useful than other small corpora of bygone days, and I believe that both still fill an important niche in today's world of English corpora.
ICE-GB is the British component of the International Corpus of English, a project that will eventually contain components from approximately twenty English-speaking countries (see http://www.ucl.ac.uk/english-usage/ice/). Each component contains one million words—600,000 spoken and 400,000 written. IGE-GB is without a doubt the most advanced component of the overall ICE project in terms of its annotation and interface, which is something that serves as the focal point of this review. DCPSE, the second corpus under review, is composed of two parts. The first is the 600,000-word spoken part of ICE-GB and the second is the 400,000-word London Lund Corpus, a corpus of spoken British English from the late 1960s through the early 1980s. In addition to the corpora users receive ICECUP ('ICE Corpus Utility Program') 3.1, the software and interface that are used to access the two corpora.
The ICECUP corpus interface allows users to interact with the corpus and corpus data in a number of different ways. Via expanding tree diagrams, users can browse through the corpora (based on several different criteria), and can limit the search to a particular 'node' of the corpus, or to texts identified by a given speaker or text variables. They can search through a lexicon of all forms in the corpus, or a 'grammaticon' of all syntactic tags and the associated words. In the 'Keyword in context' display, there are many options for customizing the display, increasing and decreasing context, and so on.
Users can carry out basic searches via the 'text fragment search', such as end* up <V> for ended up leaving, ends up watching, and so on. The heart of the search engine, however, is an extremely powerful (and yet relatively easy to use) interface that looks for 'fuzzy tree fragments'. Users create chart-like maps of the query by adding nodes and indicating part of speech, word forms, wildcards, and the like. The power of the fuzzy-tree-fragment searches comes from the fact that the corpora are not just tagged for part of speech, but are also parsed. Thus users can search for complex syntactic structures like 'notional direct objects', 'floating NP postmodifiers', 'cleft operators', and more than fifty other features. Users can also save query results, and can [End Page 443] later combine queries at virtually any level of complexity. And lest users think that it is all too complex, the creators have written a 340+ page book (Exploring natural language) that guides them carefully and clearly through the full range of possibilities. In summary, I am not aware of any other interface or parsed corpora that allows users to perform such complex...