In lieu of an abstract, here is a brief excerpt of the content:

  • The Scientific Evaluation of Music Information Retrieval Systems:Foundations and Future
  • J. Stephen Downie

Music Information Retrieval (MIR) is a multidisciplinary research endeavor that strives to develop innovative content-based searching schemes, novel interfaces, and evolving networked delivery mechanisms in an effort to make the world's vast store of music accessible to all. Some teams are developing "Query-by-Singing" and "Query-by-Humming" systems that allow users to interact with their respective music search engines via queries that are sung or hummed into a microphone (e.g., Birmingham et al. 2001; Haus and Pollastri 2001). "Queryby Note" systems are also being developed wherein searchers construct queries consisting of pitch and/or rhythm information (e.g., Pickens 2000; Doraisamy and Rüger 2002). Input methods for Queryby Note systems include symbolic interfaces as well as both physical (MIDI) and virtual (Java-based) keyboards. Some teams are working on "Query-by-Example" systems that take pre-recorded music in the form of CD or MP3 tracks as their query input (e.g., Haitsma and Kalker 2002; Harb and Chen 2003). The development of comprehensive music recommendation and distribution systems is a growing research area (e.g., Logan 2002; Pauws and Eggen 2002). The automatic generation of playlists for use in personal music systems, based on a wide variety of user-defined criteria, is the goal of this branch of MIR research. Other groups are investigating the creation of music analysis systems to assist those in the musicology and music theory communities (e.g., Barthélemy and Bonardi 2001; Kornstädt 2001). Overviews of MIR's interdisciplinary research areas can be found in Downie (2003), Byrd and Crawford (2002), and Futrelle and Downie (2002).

This article begins with an overview of the current scientific problem facing MIR research. Entitled "Current Scientific Problem," the opening section also provides a brief explication of the Text Retrieval Conference (TREC) evaluation paradigm that has come to play an important role in the community's thinking about the testing and evaluation of MIR systems. The sections which follow, entitled "Data Collection Method" and "Emergent Themes and Commentary," report upon the findings of the Music Information Retrieval (MIR)/ Music Digital Library (MDL) Evaluation Frameworks Project with issues surrounding the creation of a TREC-like evaluation paradigm for MIR as the central focus. "Building a TREC-Like Test Collection" follows next and highlights the progress being made concerning the establishment of the necessary test collections. The "Summary and Future Research" section concludes this article and highlights some of the key challenges uncovered that require further investigation.

Current Scientific Problem

Notwithstanding the promising technological advancements being made by the various research teams, MIR research has been plagued by one overarching difficulty: there has been no way for research teams to scientifically compare and contrast their various approaches. This is because there has existed no standard collection of music against which each team could test its techniques, no standardized sets of performance tasks, and no standardized evaluation metrics.

The MIR community has long recognized the need for a more rigorous and comprehensive evaluation paradigm. A formal resolution expressing this need was passed on 16 October 2001 by the attendees of the Second International Symposium on Music Information Retrieval (ISMIR 2001). (See music-ir.org/mirbib2/resolution for the list of signatories.)

Over a decade ago, the National Institute of Standards and Technology (NIST) developed a testing and evaluation paradigm for the text-retrieval community, [End Page 12] called the Text REtrieval Conference (TREC; see trec.nist.gov). Under this paradigm, each text retrieval team is given access to a standardized, large-scale test collection of text; a standardized set of test queries; and a standardized evaluation of the results each team generates.

The TREC approach to evaluation can also be thought of as an annual cycle of events. In the late fall of each year, NIST sends out its calendar and call for participation. By the end of February, the interested teams have signed up for these events. In 2001, there were 87 participating groups, representing 21 different countries (Voorhees 2002). The official test collections are then sent out to the participants in March. Over the course of the spring...

pdf

Share