In lieu of an abstract, here is a brief excerpt of the content:

  • Blind Clustering of Popular Music Recordings Based on Singer Voice Characteristics
  • Wei-Ho Tsai, Dwight Rodgers, and Hsin-Min Wang

Music is more than just a collection of notes, beats, chords, and rhythms. It is an information-rich medium in itself, capable of conveying various concepts from the concrete to the abstract—concepts that cannot be exhaustively described and ascribed textually to every music recording. As music material encoded in digital formats rapidly grows in size and number, finding the desired music information from an enormous amount of available options can be a difficult task. This problem has consequently motivated recent research towards content-based retrieval of music information. Many techniques have been developed for extracting information of interest from music residing in a digital audio signal, such as melodic content (Akeroyd et al. 2001; Durey and Clements 2002), the instrument(s) involved (Herrera et al. 2000; Eronen 2003), genre (Tzanetakis and Cook 2002; Xu et al. 2003), and the artist or singer (Whitman et al. 2001; Kim and Whitman 2002; Liu and Huang 2002). As an independent capability or as part of a music information retrieval system, techniques to automatically organize a collection of music recordings are needed in order to lessen or replace human documentation efforts. In response to this need, this study investigates the problem of clustering undocumented music recordings based on their associated singer, which also serves as an essential initial step toward full transcription of music data.

Singer-based clustering is especially useful for recognizing cameos and "covers" in music collections that may be unlabeled or insufficiently labeled. Covers, in which a singer performs a song written or made famous by a different artist, are quite common in popular music. In music collections labeled only by song name, singer-based clustering can be used to distinguish the original and covered versions of a song and even determine the artist performing the cover. Cameos, or guest appearances in a song, are substantially less common than covers, but often occur in recordings of live concerts. Music collections that are labeled with song name and artist name may still fail to include names of cameo appearances. Again, singer-based clustering solves this problem. Lastly, the careers of many artists involve collaboration with several different bands, each with different names. Music collections that are labeled only by band name, as opposed to singer name, must be cross-referenced with band membership data to determine singer information. Even so, because the relationship between band and singer may be many-to-many, this cross-referencing is insufficient to determine the singer of a given song, and again singer-based clustering may be applied.

Before clustering music recordings by singer, we must detect and exploit the underlying characteristics of the singer's voices. This task resembles the recently emerging research on clustering or segmentation of spoken data based on their associated speakers (Kimber et al. 1995; Jin et al. 1997). However, because the lion's share of popular music contains background accompaniment during most or all vocal passages, it is unfeasible to acquire voice-only data directly for drawing the desired singer's vocal characteristics like speaker-based clustering or segmentation generally does. In our earlier work (Tsai and Wang, in press), we have proposed a statistical method that leverages approximate estimation of a piece's music background to build a reliable model for the solo voice. The method has been shown effective in the problem of automatic singer recognition, in which a set of singers' reference [End Page 68] models are created off-line using pre-collected music data labeled with singer identity, and unknown music recordings are then tested on the basis of the stochastic matching for the singers' reference models. In contrast to such a supervised singer-recognition problem, this study further extends our statistical modeling of singer voice characteristics to be operated in an unsupervised manner, which assumes no prior information is available regarding the singers involved and the population of singers. Special efforts are also made to compare the similarity among singers' voices and to determine the total number of unique singers from a collection of popular music recordings.

Problem Formulation

Given a set of M unlabeled...

pdf

Share