-
5. Evaluation
- The MIT Press
- Chapter
- Additional Information
From the discussion in previous chapters, it is clear that automated text mining and effective information retrieval can help realize a wide range of biological and medical goals. These goals vary in scope and domain; some examples of these goals, ordered in ascending level of difficulty, may include the following: • Supporting curation of gene and protein information in organismspecific databases through focused, accurate retrieval; • Providing easy access to information about bio-entities within displayed text by highlighting and hyperlinking such entities; • Automatically reconstructing models of molecular networks from the published literature (which is an ambitious and not always well-defined task). Generalizing beyond explicit words, sentences, or documents, text can be more broadly viewed as an additional, rich source of data complementing other forms of data such as gene and protein sequences, expression data, mutations and other genetic variations, and many others. We revisit some of these examples in detail in chapter 6. While much work has been done in all these directions for over a decade now, a critical question is how well such systems perform. Every computational retrieval, extraction, and NLP system used in practice is judged by its performance, which is, in turn, evaluated either by popularity and user satisfaction or, more formally, through well-established performance measures. In this chapter, we focus on the latter—formal evaluation of text mining and retrieval systems. While measuring user satisfaction is a topic better covered by other disciplines such as humancomputer interaction or marketing, we briefly touch on it here, primarily 5 Evaluation 78 Chapter 5 concerning its relationship to formal evaluation initiatives within the biomedical domain. 5.1 Performance Evaluation in Text Retrieval and Extraction The evaluation of any system aims to answer a single question: How good is it? This question can easily be answered by yet another question: Good for what? The latter leads us to three main components, which are at the core of any text mining evaluation: Task A clear statement of what the evaluated system is supposed to achieve or do; Gold Standard A corpus of instances of the task at hand along with correct solutions assigned to the corpus; Evaluation Metric Objective functions by which one can quantitatively measure the performance of the system with respect to the task at hand. Such metrics are typically calculated based on the results produced by the system when applied to the gold standard. Thus, an evaluation is carried out to quantitatively measure the merit of a text mining system with respect to the specific tasks that the system is designed to perform. If a system is supposed to retrieve text relevant to a user’s needs, we should evaluate it based on how relevant the retrieved documents are to that same user’s needs. It is important to note that quality can only be accurately evaluated with respect to specific, well-defined tasks. Thus, if there are multiple dimensions or parameters on which a system should be evaluated (such as ease of use, response time, or generalizability to other domains), all these dimensions need to be specified when defining the task and carefully taken into consideration when designing the gold standard and the evaluation metrics. We should not evaluate a system based on other criteria that were not specified at the onset. Evaluation is typically carried out by comparing several systems that perform the same task on the same dataset. Specifically, for the comparison to be valid, the systems must run on the same text corpus. In order to know which systems are actually performing better, we must also know what the expected correct results are. Thus, as part of the task setting, we have a text corpus for which the correct expected answers were already given, and the systems are trying to replicate these correct answers. The text corpus along with the correct answers is typically [100.25.40.11] Project MUSE (2024-03-19 07:49 GMT) 79 Evaluation referred to as the gold standard for the task. A common underlying assumption is that the gold standard data used in the evaluation is free of errors, although clearly such an ideal scenario rarely occurs in practice. Careful analysis of the results at the end of the evaluation is likely to help expose and address imperfections in the data. Moreover, realizing the fact that the data may be noisy and imperfect serves to motivate the development of more robust systems that can perform well even in the face of missing or noisy data. To...