-
3. Information Retrieval
- The MIT Press
- Chapter
- Additional Information
In its most basic form, information retrieval is the task of finding a set of relevant documents in a large text collection. Naturally, the relevance of a document depends on our particular information need at a given moment. Most of us perform information retrieval on a daily basis, using search engines such as Google or, for searches specific to the biomedical domain, PubMed.The typical retrieval task performed using such search engines is known as ad hoc retrieval. Under this retrieval scenario, a user specifies a query, which is most often a Boolean combination of terms or words, and hopefully obtains all and only the documents within the database that satisfy the query conditions. Different users issue different queries while the same user may issue different queries at different times. During a short interactive session, even the same user may change his or her queries to express a refined or an altogether different information need. It is often difficult to accurately express the information need using Boolean combinations alone. As experience shows us, when trying to express our needs as queries to search engines, relevant documents can be missed while irrelevant ones are retrieved. We will discuss the causes for these phenomena in section 3.2. To address the problem, another form of ad hoc retrieval is based on similarity queries. Under this framework , instead of forming an explicit Boolean combination of terms, a set of terms or words constitutes the query and is matched against the text collection using some similarity criteria. The documents that are most similar to the query are retrieved. We discuss this paradigm in detail in section 3.3. In contrast to ad hoc retrieval, where queries are issued at any time an information need arises, another task in information retrieval is text categorization. In this case, the goal is to partition a set of documents into a number of categories, where the documents in each category share 3 Information Retrieval 34 Chapter 3 a topic of interest. Such categories may be a priori defined by experts—in which case we usually refer to the categorization task as classification. For instance,a collection of medical documents may be categorized based on the disease they discuss,where documents discussing lung cancer form one category and those that discuss leukemia form another.Alternatively, categories may be automatically uncovered or discovered by the categorization system itself, in which case we refer to the categorization process as clustering. These distinctions are further discussed later in the chapter. As each category is characterized by a topic, a category can be viewed as a collection of documents satisfying a certain query that defines the topic. Thus categorization turns out to be a retrieval task in which the set of queries characterizing the categories is fixed, and each document is tested against these queries to decide its category. Specific types of text categorization under this view are known as routing and filtering. We discuss text categorization and its subcomponents in section 3.5. 3.1 Example: The BRCA1 Pathway (Revisited) Let us revisit the example given in chapter 1 of searching for information about the FANCF protein, which is involved in the BRCA1 pathway. Suppose we are specifically looking for information about FANCF’s interaction with other proteins. To start, we perform a search using the Google search engine in the simplest manner. As Google’s default Boolean operator is AND, we simply type the terms FANCF BRCA1 interaction, which is interpreted as a request to retrieve the documents containing all three words:FANCF and BRCA1 and interaction. The retrieval produces a set of 4,370 hits,1 covering websites and documents containing all three terms. Many of these documents discuss BRCA1 interactions with other proteins or entities . For instance, one of them [65] states that BRCA1 directly interacts with the protein FANCA − but not with other FANC proteins − and specifically not with FANCF. Clearly, looking through the 4,370 retrieved sites and references to actually find the proteins interacting with FANCF within the BRCA1 pathway is still a difficult and time-consuming task. To narrow our search, we can simply use a limited database rather than the whole web and issue the same Boolean query against the biomedical database PubMed. In this case, typing in the terms “FANCF BRCA1 1. Query results from www.google.com obtained on June 6, 2011. [3.235.42.157] Project MUSE (2024-03-19 09:30 GMT) 35 Information Retrieval interaction” results in just four...