Volatility in School Test Scores: Implications for Test-Based Accountability Systems

Thomas J. Kane; Douglas Staiger

doi:10.1353/pep.2002.0010

In lieu of an abstract, here is a brief excerpt of the content:

Brookings Papers on Education Policy 2002 (2002) 235-283

[Access article in PDF]

Volatility in School Test Scores:
Implications for Test-Based Accountability Systems

Thomas J. Kane and Douglas O. Staiger

[Comment by David Grissmer]

[Comment by Helen F. Ladd]

[Figures]

[Tables]

By the spring of 2000, forty states had begun using student test scores to rate school performance. Twenty states have gone a step further and are attaching explicit monetary rewards or sanctions to a school's test performance. For example, California planned to spend $677 million on teacher incentives in 2001, providing bonuses of up to $25,000 to teachers in schools with the largest test score gains. We highlight an underappreciated weakness of school accountability systems--the volatility of test score measures--and explore the implications of that volatility for the design of school accountability systems.

The imprecision of test score measures arises from two sources. The first is sampling variation, which is a particularly striking problem in elementary schools. With the average elementary school containing only sixty-eight students per grade level, the amount of variation stemming from the idiosyncrasies of the particular sample of students being tested is often large relative to the total amount of variation observed between schools. The second arises from one-time factors that are not sensitive to the size of the sample; for example, a dog barking in the playground on the day of the test, a severe flu season, a disruptive student in a class, or favorable chemistry between a group of students and their teacher. Both small samples and other one-time factors can add considerable volatility to test score measures. [End Page 235]

Initially, one might be surprised that school mean test scores would be subject to such fluctuations, because one would expect any idiosyncrasies in individual students' scores to average out. Although the averaging of students' scores does help lessen volatility, even small fluctuations in a school's score can have a large impact on a school's ranking, simply because schools' test scores do not differ dramatically in the first place. This reflects the long-standing finding from the Coleman report (Equality of Educational Opportunity, issued in 1966), that less than 16 percent of the variance in student test scores is between schools. ¹ We estimate that the confidence interval for the average fourth-grade reading or math score in a school with sixty-eight students per grade level would extend from roughly the 25th to the 75th percentile among schools of that size.

Such volatility can wreak havoc in school accountability systems. To the extent that test scores bring rewards or sanctions, school personnel are subjected to substantial risk of being punished or rewarded for results beyond their control. Moreover, to the extent such rankings are used to identify best practice in education, virtually every educational philosophy is likely to be endorsed eventually, simply adding to the confusion over the merits of different strategies of school reform. For example, when the 1998-99 Massachusetts Comprehensive Assessment System test scores were released in November of 1999, the Provincetown district showed the greatest improvement over the previous year. The Boston Globe published an extensive story describing the various ways in which Provincetown had changed educational strategies between 1998 and 1999, interviewing the high school principal and several teachers. ² As it turned out, they had changed a few policies at the school--decisions that seemed to have been validated by the improvement in performance. One had to dig a bit deeper to note that the Provincetown high school had only twenty-six students taking the test in tenth grade. Given the wide distribution of test scores among students in Massachusetts, any grouping of twenty-six students is likely to yield dramatic swings in test scores from year to year--that is, large relative to the distribution of between-school differences. In other words, if the test scores from one year are the indicator of a school's success, the Boston Globe and similar newspapers around the country will eventually write similar stories praising...

Brookings Papers on Education Policy