Systems for real-time, high-quality and low-latency audio over the Internet that take advantage of high-speed networks are available and have been used in the last several years for distributed concerts and other musical applications (Renaud, Carôt, and Rebelo 2007). The difficulty of setting up one of these distributed sessions is, however, still very high. Most musicians involved in such sessions have experienced the disheartening amount of time that can be lost in rehearsal, where most of the time is spent adjusting the connection rather than playing music.
Keeping delay to a minimum is one of the main goals when tuning network parameters. Delay is known to be disruptive in musical performance (Chafe and Gurevich 2004), so a sensible goal is to minimize it as much as possible. Often, there is a tradeoff with audio quality. The longer the latency, the better the audio (i.e., fewer dropouts) if facing problematic network conditions. For most users who are not familiar with the network protocols and delivery of the Internet Protocol Suite (commonly known as TCP/IP), understanding the meaning of these protocols’ parameters can be daunting. (In particular, we use the User Datagram Protocol [UDP], which is part of the Internet Protocol Suite.)
We present here a server-based application that can be of use to intuitively tune these parameters using “auditory displays” (Chafe and Leistikow 2001). With it, musicians tune their network connection much like they do their instruments, using their ears. The implementation is part of the JackTrip application (Cáceres and Chafe 2009), a software program for low-latency, high-quality, and multi-channel audio streaming over TCP/IP wide-area networks (WANs). JackTrip is a peer-to-peer system which can interconnect many bidirectional nodes.
The design and architecture is first geared towards implementation of this quality-of-service (QoS) evaluation method. The architecture has also been extended to provides other types of service—in particular, a central “mixing hub” to control audio in a concert where multiple locations are involved.
QoS Evaluation Metrics
Cromer gives a good definition of QoS:
The term Quality of Service (QoS) refers to statistical performance guarantees that a network system can make regarding loss, delay, throughput, and jitter(Comer 2005, p. 510).
Most of the networks available today are best effort delivery, i.e., they don’t provide any specific level of QoS. As such, this infrastructure can be problematic because sound is unforgiving in regard to packet loss and jitter; any lost data is immediately audible. In evaluating a particular connection, we want to know “instantaneous” QoS, that is, assess its quality at any given moment. Users should be able to adjust their settings to achieve the optimal quality given the current bandwidth and congestion conditions. This should be convenient and a conscious part of setting up. It should also be monitored with regard to longer-term changes: a connection that is perfectly clean at 1 AM can become congested at 9 AM. A bad connection today can be a surprisingly good one a year from now when intermediate network upgrades are put in place, or when the user asks that their service be enhanced.
A connection is presently either tuned by trial and error, or is set automatically by an adaptive mechanism that changes the data rate depending on bandwidth availability (Qiao et al. 2008). Adaptive methods are typically found in unidirectional streaming and have a disadvantage for bidirectional high-quality audio. Latency is a parameter we want to keep constant. To accommodate changing amounts of jitter, adaptive methods can arbitrarily increase and decrease the local buffering, affecting total latency in a way that is very disruptive for musical performance.
In this article, we describe an implementation of a tool that lets musicians tune a connection completely by ear. Parameters like buffer size, sampling rate, packet size, and packet redundancy, among others, can be adjusted using this “auditory display” mechanism.
“Pinging” the Network, Acoustically
The advantages of evaluating very fine-grained jitter and packet loss using these “auditory displays” have been previously discussed in the literature (Chafe and Leistikow 2001). The method consists of listening to a pitched sound in order to assess delay...