In lieu of an abstract, here is a brief excerpt of the content:

  • A Large-Scale Evaluation of Acoustic and Subjective Music-Similarity Measures
  • Adam Berenzweig, Daniel P.W. Ellis, Beth Logan, and Brian Whitman

A valuable goal in the field of Music Information Retrieval (MIR) is to devise an automatic measure of the similarity between two musical recordings based only on an analysis of their audio content. Such a tool—a quantitative measure of similarity—can be used to build classification, retrieval, browsing, and recommendation systems. To develop such a measure, however, presupposes some ground truth, a single underlying similarity that constitutes the desired output of the measure. Music similarity is an elusive concept—wholly subjective, multifaceted, and a moving target—but one that must be pursued in support of applications to provide automatic organization of large music collections.

In this article, we explore music-similarity measures in several ways, motivated by different types of questions. We are first motivated by the desire to improve automatic, acoustic-based similarity measures. Researchers from several groups have recently tried many variations of a few basic ideas, but it remains unclear which are best-suited for a given application. Few authors perform comparisons across multiple techniques, and it is impossible to compare results from different authors, because they do not share the required common ground: a common database and a common evaluation method.

Of course, to improve any measure, we need an evaluation methodology, a scientific way of determining whether one variant is better than another. Otherwise, we are left to intuition, and nothing is gained. In our previous work (Ellis et al. 2002), we have examined several sources of human opinion about music similarity, with the impetus that human opinion must be the final arbiter of music similarity, because it is a subjective concept. However, as expected, there are as many opinions about music similarity as there are people to be asked, and so the second question is how to unify the various sources of opinion into a single ground truth. As we shall see, it turns out that perhaps this is the wrong way to look at things, and so we develop the concept of a "consensus truth" rather than a single ground truth.

Finally, armed with these evaluation techniques, we provide an example of a cross-site evaluation of several acoustic- and subjective-based similarity measures. We address several main research questions. Regarding the acoustic measures, which feature spaces and which modeling and comparison methods are best? Regarding the subjective measures, which provides the best single ground truth, that is, which agrees best on average with the other sources?

In the process of answering these questions, we address some of the logistical difficulties peculiar to our field, such as the legal obstacles to sharing music between research sites. We believe this is one of the first and largest cross-site evaluations in MIR. Our work was conducted in three independent labs (LabROSA at Columbia, MIT, and HP Labs in Cambridge), yet by carefully specifying our evaluation metrics, and by sharing data in the form of derived features (which presents little threat to copyright holders), we were able to make fine distinctions [End Page 63] between algorithms running at each site. We see this as a powerful paradigm that we would like to encourage other researchers to use.

Finally, a note about the terminology used in this article. To date, we have worked primarily with popular music, and our vocabulary is thus slanted. Unless noted otherwise, when we refer to "artists" or "musicians" we are referring to the performer, not the composer (which frequently are the same anyway). Also, when we refer to a "song," we mean a single recording of a performance of a piece of music, not an abstract composition, and also not necessarily vocal music.

This article is organized as follows. First we examine the concept of music similarity and review prior work. We then describe the various algorithms and data sources used in this article. Next, we describe our evaluation methodologies in detail and discuss issues with performing a multisite evaluation. Then we discuss our experiments and results. Finally, we present conclusions and suggestions for future directions.

Music Similarity

The concept of similarity has been studied many times in...

pdf

Share