Melody Detection in Polyphonic Musical Signals: Exploiting Perceptual Rules, Note Salience, and Melodic Smoothness

Rui Pedro Paiva; Teresa Mendes; Amilcar Cardoso

In lieu of an abstract, here is a brief excerpt of the content:

Melody Detection in Polyphonic Musical Signals:Exploiting Perceptual Rules, Note Salience, and Melodic Smoothness
Rui Pedro Paiva, Teresa Mendes, and Amílcar Cardoso

Melody extraction from polyphonic audio is a research area of increasing interest. It has a wide range of applications in various fields, including music information retrieval (MIR, particularly in query-by-humming, where the user hums a tune to search a database of musical audio), automatic melody transcription, performance and expressiveness analysis, extraction of melodic descriptors for music content metadata, and plagiarism detection, to name but a few. This area has become increasingly relevant in recent years, as digital music archives are continuously expanding. The current state of affairs presents new challenges to music librarians and service providers regarding the organization of large-scale music databases and the development of meaningful methods of interaction and retrieval.

In this article, we address the problem of melody detection in polyphonic audio following a multistage approach, inspired by principles from perceptual theory and musical practice. Our system comprises three main modules: pitch detection, determination of musical notes (with precise temporal boundaries, pitches, and intensity levels), and identification of melodic notes. The main contribution of this article is in the last module, in which a number of rule-based systems are proposed that attempt to extract the notes that convey the main melodic line among the whole set of detected notes. The system performs satisfactorily in a small database collected by us and in the database created for the ISMIR 2004 melody extraction contest. However, the performance of the algorithm decreased in the MIREX 2005 database.

Related Work

Previous work on the extraction of symbolic representations from musical audio has concentrated especially on the problem of full music transcription, which requires accurate multi-pitch estimation for the extraction of all the fundamental frequencies present (Martin 1996; Bello 2003; Klapuri 2004). However, the present solutions are neither sufficiently general nor accurate. In fact, the proposed approaches impose several constraints on the music material, namely on the maximum number of concurrent instruments, musical style, or type of instruments present.

Little work has been conducted on melody detection in polyphonic audio. However, this is becoming a very active area in music information retrieval, confirmed by the amount of work devoted to the ISMIR 2004 and MIREX 2005 evaluations. Several different approaches have been proposed in recent years (Goto 2001; Brossier, Bello, and Plumbey 2004; Eggink and Brown 2004; Marolt 2004, 2005; Paiva, Mendes, and Cardoso 2004, 2005b; Dressler 2005; Poliner and Ellis 2005; Ryynänen and Klapuri 2005; Vincent and Plumbey 2005; Gómez et al. 2006). (A few of these systems were originally published as non-peer-reviewed online proceedings of MIREX 2005 and, to our knowledge, were not published elsewhere.)

Generally, most current systems, including ours, are based on a front-end for frequency analysis (e.g., Fourier Transform, autocorrelation, auditory models, multi-rate filterbanks, or Bayesian frameworks), peak-picking and tracking (in the magnitude spectrum, in a summary autocorrelation function, or in a pitch probability density function), and post-processing for melody identification (primarily rule-based approaches based on perceptual rules of sound [End Page 80] organization, musicological rules, path-finding in networks of notes, etc.). One exception is Poliner and Ellis (2005), where the authors follow a different strategy by approaching the melody-detection problem as a classification task using Support Vector Machines.

Melody Definition

Before describing our system, it is important to clarify what we mean by the term melody. An important aspect regarding the perception of the main melodic stream in an ensemble is the phenomenon of figure-ground organization in audio. This is related to the "tendency to perceive part of . . . the auditory scene as 'tightly' organized objects or events (the figure) standing out against a diffuse, poorly organized background (the ground)" (Handel 1989, p. 551). In this respect, Leonard Meyer wrote

"the musical field can be perceived as containing: (1) a single figure without any ground at all, as, for instance, in a piece for solo flute; (2) several figures without any ground, as in a polyphonic composition in which the several parts are clearly segregated and are equally, or almost equally, well shaped; (3) one...

Computer Music Journal