The Interrogator as Critic: The Turing Test and the Evaluation of Generative Music Systems

Christopher Ariza

In lieu of an abstract, here is a brief excerpt of the content:

The Interrogator as Critic: The Turing Test and the Evaluation of Generative Music Systems
Christopher Ariza

Procedural or algorithmic approaches to generating music have been explored in the medium of software for over fifty years. Occasionally, researchers have attempted to evaluate the success of these generative music systems by measuring the perceived quality or style conformity of isolated musical outputs. These tests are often conducted in the form of comparisons between computer-aided output and non-computer-aided output. The model of the Turing Test (TT), Alan Turing’s proposed “Imitation Game” (Turing 1950), has been submitted and employed as a framework for these comparisons. In this context, it is assumed that if machine output sounds like, or is preferred to, human output, the machine has succeeded. The nature of this success is rarely questioned, and is often interpreted as evidence of a successful generative music system. Such listener surveys, within necessary statistical and psychological constraints, may be pooled to gauge common responses to and interpretations of music—yet these surveys are not TTs. This article argues that Turing’s well-known proposal cannot be applied to executing and evaluating listener surveys.

Whereas pre-computer generative music systems have been employed for centuries, the idea of testing the output of such systems appears to only have emerged since computer implementation. One of the earliest tests is reported in Hiller (1970, p. 92): describing the research of Havass (1964), Hiller reports that, at a conference in 1964, Havass conducted an experiment to determine if listeners could distinguish computer-generated and traditional melodies. Generative techniques derived from the fields of artificial intelligence (AI; for example, neural nets and various learning algorithms) and artificial life (e.g., genetic algorithms and cellular automata) may be associated with such tests due to explicit reference to biological systems. Yet, since only the output of the system is tested (that is, system and interface design are ignored), any generative technique can be employed. These tests may be associated with the broader historical context of human-versus-machine tests, as demonstrated in the American folk-tale of John Henry versus the steam hammer (Nelson 2006) or the more recent competition of Garry Kasparov versus Deep Blue (Hsu 2002).

Some tests attempt to avoid measures of subjective quality by measuring perceived conformity to known musical artifacts. These musical artifacts are often used to create the music being tested: they are the source of important generative parameters, data, or models. The design goals of a system provide context for these types of tests. Pearce, Meredith, and Wiggins (2002, p. 120) define four motivations for the development of generative music systems: (1) composer-designed tools for personal use, (2) tools designed for general compositional use, (3) “theories of a musical style . . . implemented as computer programs,” and (4) “cognitive theories of the processes supporting compositional expertise . . . implemented as computer programs.” Such motivational distinctions may be irrelevant if the system is used outside of the context of its creation; for this reason, system-use cases, rather than developer motivations, might offer alternative distinctions. The categories proposed by Pearce, Meredith, and Wiggins can be used to generalize about two larger use cases: systems used as creative tools for making original music (motivations 1 and 2, above), and systems that are designed to computationally model theories of musical style or cognition (motivations 3 and 4). These two larger categories will be referred to as “creative tools” and “computational models.” Although design motivation is not included in the seven descriptors of computer-aided algorithmic systems proposed in Ariza (2005), the “idiom affinity” descriptor is closely related: systems with singular idiom affinities are often computational models. [End Page 48]

Explicitly testing the output of generative music systems is uncommon. As George Papadopoulos and Geraint Wiggins (1999, p. 113) observe, research in generative music systems demonstrates a “lack of experimental methodology.” Furthermore, “there is usually no evaluation of the output by real experts.” Similarly, Pearce, Meredith, and Wiggins (2002, p. 120), presumably describing all types of generative music systems, state that “researchers often fail to adopt suitable methodologies for the development and evaluation of composition programs and this, in turn, has compromised the practical or theoretical value of...

Computer Music Journal

The Interrogator as Critic: The Turing Test and the Evaluation of Generative Music Systems

Share

Additional Information

Project MUSE Mission