- Text mining and information retrieval
In the past several years, many projects have been initiated to digitize and make available in digital format the information assets of organizations and branches of knowledge. By facilitating access to digital resources and increasing their quality of encoding and metadata, these projects have also motivated development of more effective techniques for research and analysis of textual information. Thus, with the increasing number of digital resources, techniques and strategies have been proposed to assist more effectively in the research and analysis of textual material—on the web or within documentation generated by organizations.
The interest in text-mining techniques for analysis and management of digital information lies in this area. Moreover, this domain is an important research area, in which concepts and techniques are derived mainly from work in data mining, artificial intelligence, and machine learning. However, conventional text mining (automatic clustering, automatic categorization, named entity recognition, etc.) also integrates concepts and techniques that have emerged from the field of computational linguistics. In this context, text mining combines analytical techniques from numerical and symbolic approaches based on linguistic textual data processing.
Recently, several researchers (Feldman and Sanger 2006; Ibekwe-SanJuan 2007; Srivastava and Sahami 2009; Weiss et al. 2005) have stressed the importance of exploring the appropriateness of applying techniques of text mining to information retrieval from the perspective of information science. In this effort, some online search-engine prototypes based on automatic text clustering have been developed (Carrot search, Clusty, etc.). Preliminary results of these prototypes have proven to be most relevant. However, few studies have actually explored and rigorously evaluated the appropriateness of using text-mining techniques in the context of information retrieval. [End Page 223]
This special issue of the Canadian Journal of Information and Library Science presents the results of recent research into text-mining techniques using an information retrieval perspective. The four articles comprising this volume have been selected from sixteen articles submitted for review. The evaluation was done by leading specialists in these fields.
The paper by Andreani and his colleagues, “Normalisation des entités nommées : allier règles déclaratives, ressources endogènes et processus centré sur l’utilisateur,” presents techniques for standardization of named entities encountered in documents processed by TecKnowMetrix. Named entities are expressions that denote individuals or “unique entities” such as geographic locations, names of persons, organizations or products, dates, etc. The named entities considered by these authors are primarily names of organizations that are pervasive in their corpus of patents, scientific publications, and techno-economic press articles. Their concern is to reduce all variants of the same named entity, such as Mitsubishi, Mitsubishi KK, and Mitsubishi Corp., to a single canonical form. This standardization is essential to ensure the reliable identification of all information relating to a particular entity. And it cannot benefit from linguistic resources such as lexical or terminological dictionaries, since it applies to proper names that have little to no regulation.
The normalization techniques used include (1) the use of specific vocabularies that identify the presence, the type, and sometimes the nationality of an organization, and a dictionary of country names, (2) the decomposition of complex names into their constituent base, (3) the gradual rewriting of organization names to conform to a standardized version, (4) a match of the resulting form to other, previously standardized forms, and (5) the identification of common sub-sequences of a given expression within the corpus. Their interactive approach is based on rules and on validation by a user.
Results from an expert’s evaluation are presented, distinguishing between cases where standardization is correct, partial, or incorrect (with noise or silence). Exact normalizations (depending on the type of publication) represent an average of 84% for names of organizations, 83.8% for type (“academic” or “business”), and 62.4% for organizations’ country of origin. Results are lower when combining two or three criteria. More specifically, the evaluation highlights the fact that 86.4% of named entities have had to undergo standardization before reaching the canonical expected form, indicating that the procedure is essential for efficient retrieval. [End Page 224]
The article illustrates the usefulness of text-mining techniques (here, based significantly on natural language processing...