In lieu of an abstract, here is a brief excerpt of the content:

Throughout the previous chapters we have covered a variety of textmining methods applicable to the broad range of tasks that are involved in obtaining information from text. In the beginning of chapter 1 we listed several goals within the biomedical domain that can be realized through the use of text. In this chapter we provide examples of systems and tools that have been developed to support such specific biomedical goals,and discuss in more detail the text-based methods that they employ. 6.1 Recognizing and Linking Bioentities The identification of bioentities such as genes, proteins, small molecules, drugs,and diseases,as described in chapter 4,can support numerous tasks of interest including: (1) accessing the literature relevant to specific bioentities , (2) identifying relationships among different entities, and (3) linking across a variety of data sources that provide further information about such entities. A system that makes highly effective use of finding genes, proteins, and small molecules in text is the Reflect system [177, 195]. Reflect was the winning system in the Elsevier Grand Challenge 2008 [63], which aimed to develop and showcase systems that improve access to and use of scientific information published online in databases and journals. Reflect uses a simple dictionary-based approach (as discussed in section 4.1.1) to identify genes and proteins in text. It provides a lightweight but important enhancement to web browsers, highlighting and hyperlinking in real-time genes, proteins, and small molecules mentioned within any text displayed by the browser. The text in an HTML document is processed by the Reflect server before it is displayed. Organism names are recognized, and subsequently protein and gene names as well as identifiers of small molecules are 6 Putting It All Together: Current Applications and Future Directions 100 Chapter 6 found through matching against a comprehensive dictionary.The dictionary contains names and synonyms of the relevant entities and is constructed by compiling names from multiple public resources on genes, proteins, and chemicals, while allowing multiple orthographic variations. The text is then displayed with the recognized named entities highlighted and linked to information that is shown through pop-ups when the entities are clicked on. Figure 6.1 illustrates the pop-ups and their links to the main web page.The information displayed for a selected gene or protein includes sequence and domain information (from the SMART database [136]), a graph showing significant interaction partners (from the STITCH database [232]),the best matching protein three-dimensional structure (according to the Protein Data Bank [183]), and information about the likely subcellular location of the protein. For small molecules, their two-dimensional structure is shown (from PubChem [184]), as well as significant interactions in which the molecules participate (from the STITCH database [232]). Figure 6.1 A screen shot illustrating the Reflect system. Genes, proteins, and small molecules are highlighted in the article shown in the browser window on the left. The pop-up windows on the right show detailed information about a selected protein (NR3C1) and a selected small molecule (corticosterone). (Courtesy of Sean O’Donoghue, December 2011). [3.138.33.178] Project MUSE (2024-04-26 08:54 GMT) 101 Putting It All Together: Current Applications and Future Directions Another readily accessible system that makes much use of both bioentity identification and of linking entities throughout the biomedical literature is the iHOP web-based tool [93, 94, 103]. The iHOP system links genes and proteins with the scientific literature discussing them and detects relationships among genes and proteins through their co-occurrence in the literature. Unlike Reflect, which processes one web page at a time on demand, iHOP extracts its information by methodically pre-processing PubMed. The system scans through millions of PubMed abstracts and provides access to those that discuss genes and proteins. Given a gene or a protein name, iHOP displays the sentences containing the name, while highlighting and hyperlinking other entities that may be of relevance, such as genes, proteins, and MeSH terms (see section 2.5). Using a dictionary-based approach, genes and proteins are identified, as are MeSH terms, organisms, and verbs denoting potential interactions. The identified entities are displayed to the user in the context of the sentences containing them. Text denoting other genes and proteins is highlighted and hyperlinked to the sentences discussing these entities. Verbs denoting potential interactions are clearly shown. MeSH terms are also highlighted and hyperlinked to other sentences containing the same MeSH terms. A mechanism is provided for extending the query to Google or...

Share