-
2. Fundamental Concepts in Biomedical Text Analysis
- The MIT Press
- Chapter
- Additional Information
The development of the Internet has made it easy for biologists to create databases and online portals representing various aspects of biological knowledge and to make these resources publicly available. Although there are hundreds of such online resources1 representing biological knowledge in a structured format, much of the scientific community’s knowledge is represented only as unstructured text. A structured format is one in which information is organized and represented in a formal and predefined manner. For example, a relational database consists of multiple tables corresponding to predefined relations. Each table is defined by a fixed set of fields, each of which has a prespecified meaning and data type. By contrast, information represented in ordinary natural language does not have this structured format. Sentences may describe ideas abstractly or concretely, directly or obliquely. Moreover, sentences describing the same or similar ideas may have very different syntax and employ very different vocabularies. There are several reasons why a large amount of scientific knowledge is represented only in free-text form. First, most structured databases suffer from a “curation bottleneck.”Typically, the contents of these databases are populated and maintained by scientists, known as curators, who have expertise in the area covered by the database. The curation bottleneck means that the completeness of a database is limited by the rate at which these curators can find relevant articles, extract the information of interest from them, and enter this information in a structured format into the database. In short, it is often the case that curators cannot keep up with the pertinent literature. Second, there are important facets of the biomedical literature that are not represented by existing databases. For 2 Fundamental Concepts in Biomedical Text Analysis 1. The interested reader is referred to the Nucleic Acids Research annual Database Issue, which catalogs many of these systems [72]. 10 Chapter 2 example, although there are many databases that characterize the functions of individual genes in various organisms, these databases generally do not describe how the gene functions are disrupted by pathogens that may infect the organisms. Moreover, the nuances and qualifications that typically accompany descriptions of scientific findings in articles are often not represented in structured databases, even when the findings themselves are referenced. For these reasons, there is a compelling need for methods that can exploit the vast web of knowledge that is represented in text sources. In this chapter, we describe some of the text sources that contain large amounts of biomedical knowledge and discuss the fundamental tasks of natural language processing (NLP). Natural language processing is an area that brings together methods from linguistics and computer science in order to automatically analyze and elicit meaning from text that is written in a natural language, such as English. Finally, we discuss the concepts of controlled vocabularies and ontologies and explain how these concepts are connected to the topic of biomedical text mining. 2.1 Biomedical Text Sources Biomedical knowledge represented using natural language is found in many online resources. The most accessible source of biomedically relevant text is PubMed [186], which is an online database of journal citations and abstracts. PubMed is managed by the National Library of Medicine, which is part of the US National Institutes of Health (NIH). The largest component of PubMed is a database called medline that indexes more than 5,000 biomedical journals on a regular basis. In addition to citation information and abstracts, medline entries also include index terms from a controlled vocabulary called Medical Subject Headings (MeSH), which is discussed in section 2.5. PubMed is a superset of medline that includes other citations, such as articles that are awaiting MeSH indexing before being included in medline, articles that were published in a given journal before it was selected for inclusion in medline, and other special cases. The PubMed portal [186] provides links to the full text of many articles and a variety of other services such as references to similar articles and the ability to save and automatically update queries, among others. Another obvious source of biomedically relevant text is the primary scientific literature. Virtually every biological journal has an associated website that makes published articles available in electronic form. For [3.236.111.234] Project MUSE (2024-03-19 03:24 GMT) 11 Fundamental Concepts in Biomedical Text Analysis many of these journals, access to the full text of the articles requires either having an institutional subscription to the journal or paying a perarticle fee. Commonly, online journal articles are published in...