In lieu of an abstract, here is a brief excerpt of the content:

Reviewed by:
  • Natural language processing for online applications: Text retrieval, extraction and categorization by Peter Jackson, Isabelle Moulinier
  • K. Bretonnel Cohen
Natural language processing for online applications: Text retrieval, extraction and categorization. By Peter Jackson and Isabelle Moulinier. Philadelphia: John Benjamins, 2002. Pp. 225. ISBN 1588112500. $75 (Hb).

The title of this excellent book does not do it full justice. The phrase ‘online applications’ suggests a book about natural language processing (NLP) for websites, whatever that might mean; in fact the subject matter is much more comprehensive, comprising the major areas of industrially important NLP (exclusive of speech recognition). Full chapters are devoted to information retrieval, information extraction, and content-based text categorization, and shorter but significant coverage is given to entity identification and summarization. Every topic is discussed from both symbolic and numeric perspectives. The stated intended audience is more industrial than academic, but academics will find this book more than useful, and I would not hesitate to use it as a first reading in a seminar on any of the topics covered in the book.

An introductory chapter describes realistic goals for NLP, gives examples of what makes it difficult, and sketches statistical and nonstatistical approaches. It also provides an overview of the workaday issues to be dealt with and tools for approaching them—tokenization, sentence segmentation, stemming, part-of-speech tagging, and the like.

Ch. 2 covers information retrieval. It features a very lucid explanation of indexing and of query processing, as well as clear explanations of relevance ranking with the ‘term frequency * inverse document frequency’ weighting measure and of vector space representations.

Ch. 3 covers information extraction (IE), a less-ambitious approach to natural language understanding in which a system tries not to ‘understand’ a text but rather just to extract specific sorts of facts from it—say, assertions about molecular binding events in molecular biology abstracts. This chapter is a stand-out and to my knowledge is the best overview of IE currently available. Like most discussions of IE, it covers FASTUS (Finite State Automata-based Text Understanding System), but unlike most discussions of IE, it gives a very detailed and explicit description of it. This chapter also includes substantial material on the authors’ experience with a less familiar domain, IE from legal texts.

Ch. 4 covers text categorization, which the authors define as ‘sorting documents by content’ (119). Not surprisingly, this is the most machine-learning-oriented chapter of the book and is where Bayesian classifiers, decision trees, various linear classifiers, and clustering algorithms are discussed.

Ch. 5 combines shorter coverage of two topics— named entity recognition (the location, demarcation, and classification of named things in text, e.g. finding Denver Buffalo Co., Denver, CO, and John Denver in an article, and classifying them correctly as a restaurant, a place, and a person, respectively), and summarization. The discussion of summarization is especially noteworthy, being as lucid an overview as anything one might find on this subject.

Some special features of the book include solid coverage of evaluation techniques in every chapter, excellent endnotes, and references to exactly the right stuff. However, the most salient feature of this book is the clear and cogent writing. It reads much like a series of well-written review articles and is actually enjoyable to read while not skimping at all on technical detail.

K. Bretonnel Cohen
University of Colorado
...

pdf

Share