In lieu of an abstract, here is a brief excerpt of the content:

431 15 Designing Reliable Impact Evaluations Larry L. Orr Johns Hopkins University Stephen H. Bell Abt Associates Inc. Jacob A. Klerman Abt Associates Inc. This chapter reviews the U.S. experience in evaluation of job training programs over the past 40 years, examines why it is so difficult to reliably estimate the impacts of training programs with nonexperimental methods, and discusses ways to make experimental evaluations more feasible and cost-effective. We focus exclusively on impact evaluations, studies that seek to measure the contribution of a training program to improving worker outcomes above and beyond what the same workers would have achieved without the training (known as the counterfactual). Other types of workforce-focused evaluations—such as process studies of program implementation, or participation analyses that examine program targeting—while important, are not considered here. A major distinction in our discussion is between experimental impact evaluation methods and nonexperimental impact evaluation methods. The experimental method randomly assigns eligible applicants for a training program to two groups, a treatment group that is allowed to enter the program and a control group that is not allowed to enter the program. Only by chance will subsequent outcomes of the two samples differ, unless the training improves treatment group outcomes. The difference in average outcomes between the treatment and control 432 Orr, Bell, and Klerman groups, tested for statistical significance (to rule out chance as the explanation of the observed difference) is the measure of program impact. Nonexperimental impact evaluation methods also measure outcomes for a sample of training program participants, but—not having done random assignment—have no similar control group to compare to; instead, preprogram earnings of participants or earnings of some set of nonparticipants (called a comparison group) must be used as the counterfactual. The challenge is how to find a valid comparison group and then how to control for any remaining treatment group/comparison group background differences. The obvious approach is to select the comparison group from those who were eligible for the program but chose not to enroll. However, given that they chose not to enroll, they must be different from those who chose to enroll. The alternative is to choose a comparison group from among those not eligible to enroll (e.g., from a different time period or a different geographic area, or not meeting one of the enrollment conditions). Again, whatever the condition is that makes the comparison group ineligible to enroll will also make them different from those who did enroll. Of course, a nonexperimental evaluation can and would control for observed differences between the treatment group and the comparison group, but nothing guarantees either that the only differences are in observed characteristics, or that the nature of the correction for those observed differences is correct. Thus, as we argue in detail below, those commissioning nonexperimental evaluations will always be left with the nagging concern that the nonexperimental methods chosen were not successful in producing accurate impact estimates. A BRIEF OvERvIEW OF U.S. EvALUAtIONS OF tRAINING PROGRAM IMPACtS Serious evaluation of government employment and training programs began in the United States in the 1960s, with nonexperimental impact analyses of programs funded by the Manpower Development and Training Act (MDTA). To estimate training impacts, analysts needed estimates of earnings with training and estimates of the counterfactual —what earnings would have been, for the same individuals, [3.139.90.131] Project MUSE (2024-04-26 18:08 GMT) Designing Reliable Impact Evaluations 433 without training. Earnings with training were observed. The challenge was to estimate earnings without training. Some early MDTA studies took preprogram earnings for trainees as the benchmark. The impact of treatment could then be estimated as the change in earnings from before training to after training.1 This approach clearly gave estimates of program impacts that were too large, and the reason was clear. People generally enter job training programs when they are at a low point in their labor market trajectory—e.g., when they are unemployed. As a result their earnings tended to rise, even quite substantially, even without training’s assistance. The pre–post change measure credited this natural rebound to the employment and training intervention, giving the appearance of a program impact where there was none. As it became clear that preprogram earnings were not a good counterfactual , MDTA analysts turned to comparison group strategies, in which training participants’ counterfactual earnings were estimated using a sample of similar workers in a comparison group who did not enroll in training. As...

Share