-
1. Introduction
- The MIT Press
- Chapter
- Additional Information
The current millennium started with the sequencing of the human genome. There are now thousands of sequenced genomes available, covering a wide range of organisms and a broad collection of individuals within the human population. Additionally, there is a multitude of datasets characterizing dynamic aspects of cells such as molecular abundances , interactions, and localizations. The hope is that in knowing and analyzing the sequences of such genomes and associated data, scientists are opening the “book of life” and will be able to understand the intricate processes governing life, death, and disease at the most basic molecular level. However, the enterprise of understanding this book of life is one of enormous complexity, requiring the sustained efforts of many researchers working in a wide range of scientific areas. Knowledge about biological entities and processes has been acquired by thousands of scientists through decades of experimentation and analysis. This knowledge is often represented in text form. Much of it is published in the vast biomedical literature but there are many other sources of scientific information in text form, such as lab notebooks and web pages. Currently, there are numerous organized efforts focused on representing some of this information in structured and accessible formats within publicly available databases and repositories. The goal of such efforts is to enable scientists to quickly relate and compare new findings with previous ones in the hope of expediting productive discovery and research. A notable characteristic of the current era is massive-scale production of data and information.For example,numerous types of high-throughput methods, including genome-scale sequencing, have transformed biology into a data-rich science. At the same time, there has been a steady and overwhelming increase in the number of scientific publications. For example, citations for almost four million articles were added to the 1 Introduction 2 Chapter 1 PubMed database [186] in the period from 2006 to 2010. This is more than twice as many as were added during a similar five-year period 20 years earlier. These parallel trends give rise to a situation in which biomedical researchers have more data than ever to analyze and interpret, while there is far more available background knowledge that must be taken into account for the analysis and interpretation of these data. Consider a typical case in which a high-throughput biological experiment results in significant responses for hundreds of genes.The scientific community ’s knowledge about these genes and their relationships to one another is distributed across numerous databases and thousands of articles . This situation, and many variations of it, calls for a significantly larger role to be played by automated text-analysis methods in the biomedical sciences. Consequently, there has been a surge of interest in biomedical text analysis over the past decade.Researchers from a wide range of disparate communities—including natural language processing, information retrieval , biomedical informatics, and the life sciences—have contributed ideas and applications to this enterprise. Despite the progress that has been made in biomedical text mining, this technology is not nearly as widely exploited in the biomedical domain as it could be. The goal of this book is to introduce researchers from a variety of backgrounds to the key ideas in biomedical text mining. In particular, we discuss (1) the distinct text-mining tasks that have been framed, (2) the principal challenges that are involved in addressing these tasks, (3) a broad and versatile toolbox of methods for accomplishing these tasks, (4) methodology for empirically evaluating text-mining systems,and (5) the ways in which text-mining methods are being applied to address several challenging problems in biomedicine. This book provides a structured introduction to biomedical text mining from two perspectives. One perspective is application oriented.The main chapters of the book are organized around the principal tasks that are addressed by text-mining systems, and chapter 6 describes how such systems have been assembled and applied in several significant biomedical applications. A second perspective is method oriented. We describe methods that can be employed for a wide variety of applications,focusing on the principles underlying these methods. Additionally, we discuss similarities and differences among these methods that are independent of any specific application. [54.172.169.199] Project MUSE (2024-03-19 09:30 GMT) 3 Introduction 1.1 What Is Biomedical Text Mining? The terms text mining, literature mining, or text data mining have seen much use within the biomedical domain during the past decade [4, 5, 50, 91, 128, 214, 284]. The general research area...