
Instructional Quality and Student Learning in Higher Education:Evidence from Developmental Algebra Courses
Little is known about the importance of instructional quality in American higher education because few recent studies have had access to direct measures of student learning that are comparable across sections of the same course. Using data from two developmental algebra courses at a large community college, I found that student learning varies systematically across instructors and was correlated with observed instructor characteristics including education, fulltime status, and experience. Instructors appeared to have effects on student learning beyond their impact on course completion rates. A variety of robustness checks suggested that these results do not appear to be driven by nonrandom matching of students and instructors based on unobserved characteristics or censoring of the dependent variable due to students who dropped the course before the final exam.
instructional quality, student learning, community colleges, developmental algebra
Introduction
It is welldocumented that student learning varies substantially across classrooms in elementary and secondary schools (Hanushek & Rivkin, 2010). Yet very little is known about the importance of instructional quality in America’s colleges and universities. If instructional quality in higher education varies significantly across classrooms in the same course at the same campus, then reforming instructor recruitment, [End Page 84] professional development, and retention policies could have significant potential to improve student outcomes. Unlike K–12 schools, many postsecondary institutions—especially community colleges and lessselective fouryear colleges—have significant staffing flexibility because they employ a large (and growing) percentage of parttime and untenured faculty, and thus are particularly wellpositioned to make good use of such evidence.
This article empirically examines how much student learning varies across different sections of the same courses at a large community college campus in California. This article contributes to a relatively sparse existing literature by applying modern statistical methods to studentlevel administrative data in order to examine the association between student performance on common assessments and instructor characteristics, both measured and unmeasured.
There is a significant related literature aimed at measuring the validity of student ratings of courses and instructors by examining the relationship between these ratings and student achievement on a common exam across multiple sections of the same course. Cohen (1981) reviews 41 such multisection validity studies, which cover 68 multisection courses and were mostly published in the 1970s. Most of these studies suffer from significant methodological limitations, such as the lack of random assignment or other means to control for differences in entering student ability. Of the 68 courses covered by the review, 69% provided no evidence of baseline equivalence of student characteristics across sections and 63% did not include any statistic controls for student ability.^{1}
Feldman (1989) reanalyzed data from Cohen’s (1981) metaanalysis, with a focus on studies that examined student ratings of specific instructional practices instead of overall ratings of instructors. Both metaanalyses find evidence of positive correlations between student ratings and student achievement, although those relationships vary significantly across studies and are subject to the methodological caveats discussed above. The production of multisection validity studies more or less stopped after the mid1980s (Galbraith, Merrill, & Klein, 2012). Additionally, several of the studies that have been conducted more recently raise serious questions about whether student evaluations are a good predictor of actual learning (Carrell & West, 2010; Galbraith, Merrill, & Klein, 2012; Sheets, Topping, & Hoftyzer, 1995; and Weinberg, Hashimoto, & Fleisher, 2010).^{2}
The present study is related to multisection validity studies in that it used a consistent measure of student achievement to compare outcomes across multiple sections of the same courses. But it was not a multisection validity study in the traditional sense because it does not seek [End Page 85] to validate course surveys, but rather to systematically investigate the empirical importance of instructors using modern statistical methods. In this sense, the present study is most closely related to a significant body of work documenting that K–12 teachers have a significant impact on student achievement but their effectiveness is at most weakly correlated with observable characteristics such as education and general measures of ability. These studies attempted to measure the “valueadded” that teachers have to their students’ achievement by comparing the progress that students of different teachers make on standardized tests, taking into account the demographic characteristics of the students, classrooms, and schools.
Hanushek and Rivkin (2010) summarized a number of studies in K–12 education and found that a one standard deviation increase in valueadded is associated with an average increase in student achievement of 0.15 and 0.11 standard deviations in math and reading, respectively. This research was challenged on the grounds that valueadded measures are biased by nonrandom sorting of students into different teachers’ classrooms (Rothstein, 2009). However, valueadded measures were validated by random assignment studies (Kane & Staiger 2008; Kane, McCaffrey, Miller, & Staiger 2013), a strong quasiexperimental analysis (Chetty, Friedman, & Rockoff 2014a), and longterm outcomes such as college attendance and income (Chetty, Friedman, & Rockoff 2014b).
The goal of the present study was to add to the small number of studies that have begun to extend the valueadded literature into postsecondary education using modern methods, namely studentlevel administrative data, as rich a set of control variables as possible, and repeated observations of the same instructors over time. The most credible study of this topic estimated instructor effects on student performance in courses at the U.S. Air Force Academy, and found that a standard deviation in professor quality corresponds to 0.05 standard deviations in student achievement (Carrell & West, 2010). Other studies include Bettinger and Long’s (2004) analysis of Ohio data, which does not include any direct measures of student learning; Hoffman and Oreopoulos’s (2009) study of course grades at a Canadian university; and Watts and Bosshardt’s (1991) study of an economics course at Purdue University in the 1980s.
In sum, there are very few recent empirical studies of the variation in student performance across different sections of the same course, and only one recent study from the U.S. that included direct measures of student learning and dealt with potential nonrandom matching of students and instructors. The dearth of evidence on postsecondary student [End Page 86] learning is in large part the result of data limitations. The kinds of standardized measures of learning outcomes common in K–12 education are rare at the college level, so the key challenge for research in this area is to gather studentlevel data from courses with reasonably large enrollments that have administered the same summative assessment to students in all sections of each course for several semesters. Common final exams are uncommon in practice because they present logistical challenges to the institution (such as agreeing on the content of the exam and finding space to administer it at a common time) and run against traditions of faculty independence.
This article overcomes many of these limitations by using data from Glendale Community College in California, which has administered common final exams in two developmental algebra courses for the past decade. Developmental, or remedial, courses cover material that is below college level with the aim of preparing students to take collegelevel coursework. Remedial courses at community colleges form a significant slice of American higher education. Fortyfour percent of U.S. undergraduates attend community colleges (American Association of Community Colleges, 2012), and 42% of students at twoyear colleges take at least one remedial course (National Center for Education Statistics, 2012).
I used these data to assess both student learning and instructional quality. “Student learning” refers to student mastery of algebra (a subject most students should have been exposed to in high school), which in this article is measured using scores on common final exams, as described below. “Instructional quality” refers not to any measure of actions taken in the classroom (such as observations of class sessions), but rather to the full set of classroom interactions that affect student learning, including the ability of the instructor, the quality of instruction delivered by that instructor (including curriculum, teaching methods, etc.), and other classroomlevel factors such as peer effects. I measured the quality of instruction as how well students in a given section performed on the common final exam.
My analysis of data from eight semesters that cover 281 sections of algebra taught by 76 unique instructors indicates that student learning varies systematically across instructors and is correlated with observed instructor characteristics including education, fulltime status, and experience. Importantly, instructors appear to have effects on student learning beyond their impact on course completion rates. These results do not appear to be driven by nonrandom matching of students and instructors based on unobserved characteristics, but should not be regarded as definitive given the limited scope of the dataset. [End Page 87]
Institutional Background and Data
This study took advantage of a sophisticated system of common final exams that are used in two developmental math courses, elementary and intermediate algebra, at Glendale Community College (GCC). GCC is a large, diverse campus with a collegecredit enrollment of about 25,000 students.^{3} New students are placed into a math course based on their score on a math placement exam unless they have taken a math course at GCC or another accredited college or have a qualifying score on an AP math exam. The first course in the main GCC math sequence is arithmetic and prealgebra. This article uses data from the second and third courses, elementary algebra and intermediate algebra. These courses are both offered in one and twosemester versions, and the same common final exams are used at the end of the onesemester version and the second semester of the twosemester version. Students must pass elementary algebra with a C or better in order to take intermediate algebra, and must achieve a C or better in intermediate algebra in order to begin taking collegelevel math classes.^{4}
The algebra common final system has existed in its current form for about five years. The exams are developed for each course (each semester) by a coordinator, who receives suggestions from instructors. However, instructors do not see the exam until shortly before it is administered. In order to mitigate cheating, two forms of the same exam are used and instructors do not proctor the exams of their own students. Instructors are responsible for grading a randomly selected set of exams using right/wrong grading of the questions, which are all openended (i.e. not multiplechoice). The questions on the common final largely fall into one of the following categories of tasks: graphing equations, simplifying expressions, and solving an equation or set of equations for one or more unknowns.^{5} Instructors do maintain some control over the evaluation of their students in that they can regrade their own students’ final exams using whatever method they see fit (such as awarding partial credit), but only the grade based on right/wrong grading (and thus consistent across instructors) is available in my data extract.^{6}
My data extract includes the number of items correct (usually out of 25 questions) for each student that took the final exam in the eight semesters from Spring 2008 through Fall 2011. The common final exam data are linked to administrative data on students and instructors obtained from GCC’s institutional research office. The administrative data contain 14,220 observations of 8,654 unique students. Background data on students include their math placement level (upon entry to the [End Page 88] college), race/ethnicity, gender, receipt status of a Board of Governors (BOG) fee waiver (a proxy for financial need), birth year and month, units (credits) completed, units attempted, and cumulative GPA (with the latter three variables measured as of the beginning of the semester in which the student is taking the math course). The administrative records also indicate the student’s grade in the algebra course and the days and times the student’s section met.
The student records are linked to data on instructors using an anonymous instructor identifier. The instructor data, which cover 76 unique instructors of 281 sections over eight semesters, include education level (master’s, doctorate, or unknown), fulltime status, birth year and month, gender, ethnicity, years of experience teaching at GCC, and years of experience teaching the indicated course (with both experience variables topcoded at 12 years).
Table 1 shows summary statistics for students and instructors by algebra course (statistics for instructors are weighted by student enrollment). Each record in the administrative data is effectively a course attempt, and 23% of the records are for students who dropped the course in the first two weeks of the semester. Of these students, about one fifth enrolled in a different section of the same course in the same semester. Given the significant falloff in enrollment early in the semester, Table 1 shows summary statistics for both the original population of students enrolled in the course and the subgroup remaining enrolled after the earlydrop deadline. However, excluding these students does not qualitatively alter the pattern of summary statistics, so I focus my discussion on the statistics based on all students.
Glendale students are a diverse group. About onethird are Armenian, and roughly the same share are Latino, with the remaining students a mix of other groups including no more than 10% white (nonArmenian) students. Close to 60% are female, about half are enrolled fulltime (at least 12 units), twothirds received a BOG waiver of their enrollment fees, and the average student is 24 years old. The typical student had completed 27 units as of the beginning of the semester, and the 90% who had previously completed at least one gradebearing course at Glendale had an average cumulative GPA of approximately a C+. Student characteristics are fairly similar in elementary and intermediate algebra, except that intermediate students are less likely to be Latino, have modestly higher grades and more credits completed, and (unsurprisingly) higher math placement levels.
The typical instructor in my data was a parttime employee with a master’s degree who taught a section of 52–55 students that dropped [End Page 89]
[End Page 90]
to 41–42 students by two weeks into the semester. Only 10–14% had doctoral degrees, and terminal degree was unknown for 20%. Fulltime instructors taught 16–19% of students, and the average instructor had 6–7 years of experience teaching at Glendale Community College, with 4–5 of those years teaching the algebra course.^{7}
Student success rates, in terms of the traditional metrics of course pass rates, were similar in the two algebra courses, as shown in Table 2. Just under 80% made it past the twoweek earlydrop deadline, 58% completed the course (i.e. didn’t drop early or withdraw after the earlydrop deadline), just over half took the final exam, just under half earned a passing grade, and 36–38% earned a grade of C or better (which is needed to be eligible to take the next course in the sequence or receive transfer credit from another institution). Among students who did not drop early in the semester, close to twothirds took the final, most of whom passed the course (although a significant number did not earn a C or better).
The typical student who took the final exam answered 38% of the questions correctly in elementary algebra and 32% in the intermediate course. The distribution of scores (number correct out of 25) is shown in Figure 1 for the semesters in which a 25question exam was used. Students achieved a wide range of scores, but few received very high scores. In order to facilitate comparison of scores across both courses and semesters, I standardized percent correct by test (elementary or intermediate) and semester to have a mean of zero and standard deviation of one. I associated a student’s final exam score only with the [End Page 91] records corresponding to their successful attempt at completing the course; I did not associate it with records corresponding to sections that they switched out of.
The common final exams used at GCC are developed locally, not by professional psychometricians, and thus do not come with technical reports indicating their testretest reliability, predictive validity, etc. However, I am able to validate the elementary algebra test by estimating its predictive power visàvis performance in intermediate algebra. Table A1 shows the relationship between student performance in beginning algebra, as measured by final grade and exam score, and outcomes in intermediate algebra. The final grade and exam score are fairly strong correlated (r = 0.79) so the multivariate results should be interpreted with some caution. Table A1 indicates that a onestandarddeviation increase in elementary algebra final exam score is correlated with an increase in the probability of taking intermediate algebra of 13 percentage points, an increase in the probability of passing with a C or better of 17 percentage points, and an increase in the intermediate exam score of 0.57 standard deviations. The latter two of these three correlations are still sizeable and statistically significant after controlling for the letter grade received in elementary algebra. [End Page 92]
Methodology
I estimated the relationship between student outcomes in elementary and intermediate algebra and the characteristics of their instructors using regression models of the general form:
where Y_{ijct} is the outcome of student i of instructor j in course c in term t, α is a constant, T_{jt} is a vector of instructor characteristics, X_{it} is a set of student control variables, γ_{ct} is a set of coursebyterm fixed effects, and ε_{ijct} is a zeromean error term. Standard errors were adjusted for clustering by instructor, as that is the level at which most of the instructor characteristics vary. All models were estimated via ordinary least squares (OLS), but qualitatively similar results are obtained using probit models for binary dependent variables.
The instructor characteristics included in the model were education (highest degree earned), fulltime status, and years of experience teaching at GCC. I also included dummy variables identifying instructors with missing data, but only report the coefficients on those variables if there were a nontrivial number of instructors with missing data on a given variable. Student controls, which are included in some but not all models, included race/ethnicity, indicator for receipt of a BOG waiver, age, fulltime status, cumulative GPA at the start of the term (set to zero when missing, with these observations identified by a dummy variable), units completed at the start of the term, and math placement level. The coursebyterm effects captured differences in the difficulty of the test across terms and algebra levels (elementary and intermediate), as well as any unobserved differences between students in the same algebra level but different courses (i.e. the one vs. twosemester version).
I also estimated models that replaced the instructor characteristics with instructorspecific dummies. These models were estimated separately by semester and included course dummies as well as student control variables. Consequently, the estimated coefficients on the instructor dummies indicated the average outcomes of the students of a given instructor in a given semester compared to similar students that took the same course in the same semester with a different instructor. I also created instructorlevel averages of the instructorbyterm estimates that adjusted for sampling variability using the Bayesian shrinkage method described by Kane, Rockoff, and Staiger (2007). This adjustment shrinks noisier estimates of instructor effects (e.g., those based on smaller numbers of students) toward the mean for all instructors. [End Page 93]
The primary outcomes examined in this article are: whether the student took the final exam, whether the student passed the course with a grade of C or better (needed to progress to the next course in the math sequence), and the student’s standardized score on the final exam. The estimates thus indicate the correlation between instructor characteristics (or the identity of individual instructors, in the case of the fixed effects models) and student outcomes, conditional on any control variables included in these models. These estimates cannot be interpreted as the causal effect of being taught by an instructor with certain characteristics (or a specific instructor) if student assignment to sections is related to unobserved student characteristics that influence achievement in the course. For example, if highly motivated students on average try to register for a section with fulltime (rather than parttime) instructors, then the estimate of the difference between full and parttime instructors will be biased upwards.
As discussed above, the fact that students in K–12 education are not randomly assigned to classrooms is welldocumented, but there is also evidence that “valueadded” models that take into account students’ achievement prior to entering teachers’ classrooms can produce teacher effect estimates that are not significantly different from those obtained by randomly assigning students and teachers. The challenges to the identification of causal effects related to the nonrandom matching of students and instructors may be more acute in postsecondary education for at least two reasons. First, the prioryear test scores that serve as a proxy for student ability and other unmeasured characteristics in much research on K–12 education are not usually available in higher education. This study was able to partly overcome this concern using a relatively rich set of control variables that included cumulative GPA at the beginning of the semester. Additionally, I was able to estimate results for intermediate algebra that condition on the elementary algebra final exam score (for students who took both courses at GCC during the period covered by my data).
Second, college students often select into classrooms, perhaps based on the perceived quality of the instructor (as opposed to being assigned to a classroom by a school administrator, as is the case in most elementary and secondary schools). At GCC, students are assigned a registration time when they can sign up for classes, and certain populations of students receive priority, including former foster children, veterans, and disabled students. Discussions with administrators at GCC indicate that students sometimes select sections based on instructor ratings on the “Rate my Professor” web site (GCC does not have a formal course evaluation system), but, anecdotally, this behavior has decreased since the use of common [End Page 94] final exams has increased consistency in grading standards. An approximate test for the extent to which nonrandom matching of students to instructors affects the estimates report below is to compare results with and without control variables. The fact that they are generally similar suggests that sorting may not be a significant problem in this context.
To the extent that students do nonrandomly sort into classrooms, they may have stronger preferences for classes that meet at certain days/times than they do for specific instructors. However, the descriptive statistics disaggregated by course meeting time shown in Table A2 indicate that any sorting that occurs along these lines is not strongly related to most student characteristics. A few unsurprising patterns appear, such as the proclivity of parttime and older students to enroll in sections that meet in the evening. Table A2 includes a summary measure of student characteristics: the student’s predicted score on the final exam based on their characteristics (estimated using data from the same course for all semesters except for the one in which the student is enrolled). This metric indicates that students enrolled in sections with later start times have somewhat more favorable characteristics in terms of their prospective exam performance, but not dramatically so.
I also used these predicted scores to examine whether instructors were systematically matched to students with favorable characteristics. Specifically, I aggregated the predicted scores to the instructorbyterm level and calculated the correlation between the average predicted scores of an instructor’s students in a given term and in the previous term. The correlation, while nonzero, was relative weak (r = 0.20). Excluding students who drop the course early in the semester (or switch to another section), another source of sorting, further reduced the correlation to r = 0.11. In the results section below I show that these correlations were much weaker than the termtoterm correlation in instructors’ estimated effects on students’ actual exam scores.
An additional complication in the analysis of learning outcomes in postsecondary education is the censoring of final exam data created by students dropping the course or not taking the final.^{8} In the algebra courses at GCC, 47% of students enrolled at the beginning of the semester do not take the final exam. Students who do not take the final exam have predicted scores (based on their characteristics) 0.18 standard deviations below students who do take the exam. The censoring of the exam data will bias the results to the extent that more effective instructors are able to encourage students to persist through the end of the course. If the marginal students perform below average, then the average final exam score of the instructor’s students will understate her true contribution. [End Page 95]
Data aggregated to the instructorbyterm level indicated that the share of students that took the final was negatively correlated with the average exam score—more students taking the final means a lower score, on average. But the correlation is quite weak (r =–0.15), suggesting that much of the variation in sectionlevel performance on the exam is unrelated to attrition from the course. This would be the case if, for example, dropout decisions often have nonacademic causes such as unexpected financial or family issues.
I also addressed the issue of missing final exam data for course dropouts below by estimating models that impute missing final exam scores in two different ways. First, I made the most pessimistic assumption possible by imputing missing scores as the minimum score of all students in the relevant course and term. Second, I made the most optimistic assumption possible by using the predicted score based on student characteristics. This assumption is optimistic because the prediction was based on students who completed the course, whereas the dropouts obviously did not and thus were unlikely to achieve scores as high as those of students with similar characteristics who completed the course. Below I show that the general pattern of results is robust to using the actual and imputed scores, although of course the point estimates are affected by imputing outcome data for roughly half the sample.
Results
I began with a simple analysis of variance that produces estimates of the share of variation in student outcomes in algebra that are explained by various combinations of instructor and student characteristics. I estimated regression models of three outcomes—taking the final exam, passing with a grade of C or better, and final exam score—and report the adjusted rsquared value in Figure 2.^{9} The baseline model included only termbycourse effects, which explain very little of the variation in student outcomes. Adding instructor characteristics (education, fulltime status, and experience teaching at GCC) increased the share of variance explained by a small amount, ranging from about 0.5% for final taking and successful completion rates to about 1% for final exam scores.
Replacing instructor characteristics with instructor fixed effects had a more noticeable effect on the share of variance explained, increasing it by 1.9–2.5% for the course completion outcomes and by almost 8% for final exam scores. Adding student controls to the model, which themselves explain 10–18% of the variation in outcomes, did not alter the pattern of results without controls: instructor education, fulltime status, [End Page 96] and experience explained much less variation in outcomes than instructor fixed effects.
The estimated relationships between instructor characteristics and student outcomes in pooled data for elementary and intermediate algebra are reported in Table 3.^{10} Education was the only variable that was statistically significantly related to the rates at which students take the final exam and successfully complete the course (with a C or better): the students of instructors with doctoral degrees were 5–7 percentage points less likely to experience these positive outcomes as compared to the students of instructors with master’s degrees. Students of fulltime instructors were 3–4 percentage points more likely to take the final and earn a C or better than students of parttime instructors, but these coefficients are not statistically significant. The point estimates for instructor experience did not follow a consistent pattern.
Instructor characteristics were more consistent predictors of student performance on the final exam. Having an instructor with a doctoral degree, as compared to a master’s degree, is associated with exam scores that were 0.15–0.17 standard deviations lower (although only the result with student controls was statistically significant and only at the 10% level). The students of fulltime instructors scored 0.21–0.25 standard [End Page 97]
[End Page 98]
deviations higher than their counterparts in the classrooms of parttime instructors. Returns to instructor experience at GCC were not consistently monotonic, but suggested a large difference between firsttime and returning instructors of about 0.20 standard deviations. This result could reflect both returns to experience as well as differential attrition, in which higher quality instructors are more likely to remain at GCC.^{11}
The coefficient estimates were not substantially altered by the addition of studentlevel control variables, suggesting that students did not sort into sections in ways that were systematically related to both their academic performance and the instructor characteristics examined in Table 3. Given the dearth of data on student learning in postsecondary education, the estimated coefficients on the control variables, reported in Table A3, are interesting in their own right. Results that were consistent across all three outcomes included higher performance by Asian students and lower performance by black and Latino students (all compared to white/Anglo students), and better outcomes for older students and for women. One of the strongest predictors of outcomes is cumulative GPA at the start of the term, with an increase of one GPA point (on a fourpoint scale) associated with an increase in final exam score of 0.39 standard deviations. Students who are new to GCC (about 10% of students), as proxied by their missing a cumulative GPA, outperform returning students by large margins.^{12}
Tables 4 and 5 report the results of three robustness checks. First, I included controls for the time of day that the course met to account for any unobserved student characteristics associated with their scheduling preferences. Adding this control left the results largely unchanged (second column of Table 4). Second, using imputed likely minimum and maximum scores for students who did not take the final exam had a larger impact on the point estimates, as would be expected from roughly doubling the sample, but the general pattern of results was unchanged (last two columns of Table 4). The experience results were the most sensitive to this change.
Finally, I estimated a “valueadded” type model for intermediate algebra scores only where elementary algebra scores were used as a control variable. The advantage of this model was that the elementary algebra score was likely to be the best predictor of performance in intermediate algebra, but this comes at the cost of only being able to use data from one of the two courses, and only for students who completed the lowerlevel course at GCC during the period covered by my data. Consequently, the results were much noisier than the main estimates, and Table 5 indicates that simply restricting the sample to students with elementary algebra scores available in the data changes the [End Page 99] results somewhat, especially the estimated returns to experience. However, adding the elementary algebra score as a control left the pattern of results largely unchanged. In this analysis, the most robust finding was the substantial difference in student performance between instructors with doctoral and master’s degrees (in favor of the latter).
The outcomes examined thus far were all measured during the term that the student was in the instructor’s class. It could be the case that instructors work to maximize performance on the final exam at the expense of skills that have longerterm payoffs, as in Carrell and West’s (2009) study of the U.S. Air Force Academy (although in their case the shortterm outcome was course evaluations). In Table A4, I show the estimated relationship between the characteristics of elementary algebra instructors and three outcomes of their students after taking the course: whether they take intermediate algebra, whether they pass with a C or better, and their score on the intermediate algebra exam.^{13}
The results were imprecisely estimated given the reduced sample size. The point estimates for fulltime instructors were generally [End Page 100] positive, but usually not large enough to be statistically significant. The results for experience indicate, for the final exam only, that students of firstyear instructors fare better in the followon course than those of veteran instructors, but this result was fairly sensitive to the inclusion of control variables and is based on only 27% of the students who took elementary algebra. In sum, the results in Table A4 do not bolster the results based on immediate learning outcomes, but they are not convincing enough to undermine them either.
The analysis of variance indicated that instructor fixed effects explain a much greater share of the variation in learning outcomes than the handful of instructor characteristics available in the administrative data. As explained in the methodology section above, I estimated instructorbysection effects separately by term and included the same [End Page 101] studentlevel controls used in the analysis of instructor characteristics. The standard deviations of the estimated instructor effects on taking the final exam and the final exam score are shown in Table 6. The 267 estimated instructorbyterm effects have a standard deviation of 0.11 for taking the final (i.e. 11 percentage points) and 0.37 for the exam score (i.e. 0.37 studentlevel standard deviations). The correlation between the two is–0.10, similar to the correlation for the unadjusted data discussed above.
Part of the variability in the instructorbyterm effect estimates resulted from sampling variation, especially with the relatively small numbers of students enrolled in individual classrooms (and even smaller number that take the final). This variability tended to average out over multiple terms. The second row of Table 6 shows that averaging all available data for the 76 instructors in the data produced a standard deviation of instructorlevel effects of 0.09 for taking the final and 0.31 for the exam score. Shrinking these estimates to take into account the signaltonoise ratio further reduced the standard deviations to 0.05 and 0.21, respectively. These estimates are also robust to using a random effects specification (estimated using hierarchical linear models), as shown by the last two rows of Table 6.
The fact that the standard deviations of the estimated instructor effects remain substantial after adjusting for sampling variability in the fixed effects specifications is due in part to the relatively strong correlation between the estimated effects of the same instructor over time. Figure 3 shows the relationship between the effect estimate for each instructor and the estimate for the same instructor in the prior term that she taught, which have a correlation of r = 0.56. [End Page 102]
The relative stability of instructor effects over time suggests that they are capturing something persistent about the quality of instruction being delivered. As a further check on these results, I estimated the relationship between student performance in algebra and the effectiveness of the instructor measured using data from all semesters other than the current one.^{14} This means that idiosyncratic variation in student performance specific to a student’s section will not be included in the estimated instructor effect. Table 7 shows that the instructor effect was a powerful predictor of student performance on the final exam, but not on the likelihood that the student would take the final or pass the course. An increase in the estimated instructor effect of one standard deviation (measured in student scores) is associated with a 0.95standarddeviation in student scores. Given the standard error, I cannot reject the null hypothesis of a oneforone relationship.
Table 7 also shows the relationship between the estimated effects of elementary instructors and student outcomes in the followon course (intermediate algebra). Students who had an elementary instructor with a larger estimated effect were no more likely to take intermediate algebra, but were more likely to complete the course successfully. These students are also predicted to score higher on the intermediate algebra common final, but this relationship was imprecisely estimated and its [End Page 103]
[End Page 104]
statistical significance was not robust to excluding the eight percent of testtakers who had the same instructor in both elementary and intermediate algebra.
Limitations
The results reported in this study suggest that individual instructor effects are potentially more important for student learning in developmental algebra courses than observed instructor characteristics. It is not surprising that variation in the quality of instruction related to both observed and unobserved characteristics is greater than the variation explained by observed characteristics on their own, but these findings should still be interpreted cautiously given that they are based on data from a relatively modest number of instructors (76) in a handful of courses at a single institution.
This limitation highlights a significant challenge of conducting research on student learning in higher education. Unlike in K–12 education, where it is possible to obtain administrative data on standardized test performance for an entire district or state, research efforts in higher education will usually be limited to a single course or handful of courses that use common assessments. This means that the total number of instructors will likely not be very large, and that the results will be based on a very specific context—no more than a handful of courses on a single campus.
This study examined data from developmental algebra courses on a community college campus in California that enrolls a unique mix of students (including a large Armenian population). Clearly this limits the extent to which the results can be extrapolated to other contexts, such as higherlevel courses, campuses in other states, fouryear campuses, and courses in other subjects, just to name a few potentially relevant dimensions. A college would have to make a systematic effort to encourage the use of common finals across a wide variety of courses, and incorporate the resulting data into its administrative records, for a more systematic single study to be possible.
Another limitation of this study is that the measure of instructional quality, performance on common assessments, does not capture instructor actions. Gathering any objective information on student learning that is consistent across different instructors is an important first step toward better understanding the role of instructional quality. Whether the quality of instruction can be measured using various nontestbased methods, such as student surveys and observations by faculty and administrators, is a ripe subject for future research. Some of the multisection validity studies discussed above indicate that there is a positive relationship between [End Page 105] student course ratings and student achievement, and there is also some research from K–12 education documenting a relationship between certain types of student survey questions and performance on standardized tests (MET Project, 2012). Expanding this line of research to include modern methods and raters other than students has clear potential value.
Any single measure of student learning or instructional quality is going to be limited, both by the inherent limitations of the measure itself (e.g., the properties of a test) and by associated methodological challenges (e.g., missing values, data censoring, selection bias issues, etc.). Efforts to incorporate these measures into policy and practice should acknowledge these limitations and avoid the dangers of focusing too much on any single measure.
Conclusion
The limitations discussed above make clear that the results of this study should not be assumed to hold across a wider array of courses and institutions. At the same time, the findings raise interesting questions about teaching and learning in higher education. The finding that fulltime instructors outperform parttime instructors, on average, suggests that providing fulltime employment (and the associated compensation and benefits) either attracts better employees, retains them for longer, or enhances quality by enabling them to focus on their teaching responsibilities at a single institution rather than cobbling together work across multiple institutions.
The finding that instructors with master’s degrees are more effective in the classroom, on average, than instructors with doctorates is more difficult to explain. One speculative possibility is that PhDlevel instructors are more interested in teaching upperlevel math courses, or students with strong interest and ability in math, which is likely very different from teaching remedial algebra courses. Another possibility is that instructors with master’s degrees are more likely to have sought employment at a community college, whereas those with PhDs may have preferred other options (such as a more researchfocused institution). Descriptive empirical work on the labor market for postsecondary instructors is a ripe area for future research.
More generally, despite the limitations of any analysis of data from a small number of courses, this article exemplifies the kind of work that can be done with data on student learning that are comparable across sections of a course taught by different instructors—data that are rarely available in American higher education. Importantly, it shows that examining only course completion rates can miss important variation in [End Page 106] student learning. Students who complete a course vary widely in their mastery of the material, which influences their likelihood of success in followon courses. The absence of random assignment of students and teachers is likely to be a challenge in all research on this subject, although the GCC data offer preliminary evidence that this is not as important a concern as it is in other contexts.
The study does not examine instructional practices, and thus cannot offer guidance to individual instructors on how to improve their craft. But the fact that meaningful differences in the quality of instruction delivered exist suggests that faculty and administrators might be able to improve student outcomes through evaluating instructors and providing targeted feedback based on data such as performance on common assessments. The new generation of K–12 teacher evaluation systems currently, which combine information on student growth on standardized tests with observationbased feedback from administrators and master teachers (Whitehurst, Chingos, & Lindquist, 2014), might serve as a starting point for similar efforts in higher education. The observation component can serve both to provide actionable feedback to instructors and make possible research that links specific teacher behaviors to student learning.
Matthew M. Chingos is a Senior Fellow at the Urban Institute; mchingos@urban.org.
Notes
1. Cohen (1981) found no evidence of a statistically significant correlation between these study design features and the instructor/achievement correlation. However, standard errors are not reported and given the sample size (68) it seems unlikely that qualitatively meaningful relationships could be ruled out with a reasonable degree of confidence.
2. There are also studies that use student evaluations as a measure of the quality of teaching. For example, Umbach and Wawrzynski (2005) examined the correlation between student engagement (based on student surveys) and faculty selfreported practices at the institutional level. However, this line of work is of limited relevance given that the validity of surveybased measures of the quality of instruction is still a contested subject.
3. About GCC, http://glendale.edu/index.aspx?page=2
4. “Glendale Community College Math Sequence Chart,” April 2012, http://www.glendale.edu/Modules/ShowDocument.aspx?documentid=16187
5. Sample exams are available at http://www.glendale.edu/index.aspx?page=699
6. The grade assigned by the student’s instructor is the one that is factored into students’ course grades, but anecdotal evidence indicates that Glendale administrators used the results of the common final exam to discourage grade inflation by instructors. Conversations with administrators also indicated that the common final results were not used as part of any formal evaluation system, but were sometimes used to spot potential problems (other than grade inflation) with individual instructors, especially with parttime instructors. [End Page 107]
7. Fulltime instructors were more likely to have a doctoral degree than parttime instructors, but only by a margin of 25 vs. 10% (not shown in Table 1). In other words, the majority of fulltime instructors did not have doctoral degrees.
8. For an earlier discussion of this issue, see Sheets and Topping (2000).
9. I obtained qualitatively similar results using unadjusted rsquared.
10. Results disaggregated by algebra level are available from the author upon request. These results were less precisely estimated and although the coefficients generally pointed in the same direction as the pooled results, some of the patterns observed in the pooled results were stronger in one course than in the other.
11. In separate models (not shown) I replaced instructor experience at GCC with experience teaching the specific course and did not find any consistent evidence of returns to this measure of experience.
12. The difference between new and returning students may reflect selection into taking math courses early vs. late in the college career, with students who took math courses early being more confident in their math ability and having taken math more recently (i.e. in high school).
13. The intermediate algebra taking and completing variables were defined for all students, whereas the final exam was only defined for students who took the final at some point in the period covered by my data. Additionally, I misclassified as nontakers (and noncompleters) students who took intermediate algebra during the summer or in a selfpaced version (both of which were not included in my data extract).
14. Specifically, I averaged the instructorbyterm effects for all terms except for the one during which the student was enrolled.