- Influences of Signal Processing, Tone Profiles, and Chord Progressions on a Model for Estimating the Musical Key from Audio
Tonality analysis is an important part of studying music, and in recent years, automatic key estimation from audio input has become an important part of music information retrieval. The primary key is often given in the title of classical music compositions, together with some information about the structure or style, such as Sonata in F or Scherzo in G, which suggests that it is considered by composers to be an important means of classification. This article investigates the effects of three important aspects of many key-estimation algorithms: low-level digital signal processing parameters, tone-profile values, and harmonic progression through time, using a key estimation algorithm based on a Hidden Markov Model (HMM).
The terms "key" and "tonality" are often used interchangeably. In this article, we use "key" to refer to a single, discrete tonal center and its associated scale, with the acknowledgment that there may be simultaneous keys present in a given piece of music. We use "tonality" to refer to the more abstract concept of the music's relationship to all possible keys, which can be modeled as a position within some kind of psychologically informed geometrical tonal space, such as Chew's Spiral Array (Chew 2001).
We begin by explaining the parameters under investigation, and then we describe the technique using HMMs on which we base our investigations. We also explain how the model has been altered to test the importance of DSP parameters, tone profile values, and harmonic progression through time. We provide details of our experiments, which test the algorithm for global key estimation on a set of Beatles songs and a set of preludes and fugues from J. S. Bach's Well-Tempered Clavier. We then present the results, which strongly support the use of musically informed tone profiles, and which also suggest that limiting the frequency range is beneficial. Large variations in the results show that investigation into the low-level parameters is worthwhile. Finally, we summarize our findings and suggest some directions for further research.
Some Important Issues in Tonality Estimation
Several aspects of automatic tonality estimation are common to many published algorithms. We investigate three of these aspects in the context of one particular tonality-estimation method that is based on our previous work (Noland and Sandler 2006, 2007).
Low-Level Digital Signal Processing
To work with music in digital-audio form rather than in a symbolic representation such as MIDI, some low-level digital signal processing (DSP) is required to transform the raw audio into a form that is useful for tonality estimation. Most audio tonality-estimation systems use a logarithmically spaced frequency representation of the signal as their foundation, because these map elegantly to the notes of the equal-tempered chromatic scale.
Calculation of these features often starts with downsampling to reduce the amount of data and therefore the running time. However, it is well known that frequencies above half the sampling rate cannot be represented, so they must be removed before downsampling, which causes any high-frequency information to be lost. With real filters it is not possible to completely suppress the high frequencies—only to reduce their level—so [End Page 42] after downsampling, there will always be small components derived from them that appear at a lower and usually inharmonic pitch. This is known as aliasing. Downsampling also results in fewer samples per second, which means that the time resolution of the signal is reduced by the same factor as the sampling rate. For a more detailed introduction to DSP, see for example McClellan, Schafer, and Yoder (1998).
Click for larger view
View full resolution
It is then necessary to divide the audio into frames, and decisions must be made regarding frame length (also called the window size) and hop size (which determines the overlap between frames). We must also decide whether to use a data-driven technique for calculating optimal frame sizes such as beat detection (Davies and Plumbley 2004) or tonal-change detection (Harte, Sandler, and Gasser...