The Case for Summative and Formative Use of Student Ratings of Teachers
by Jeff Koon, Ph.D., with the advice-consultation of Ann Hobbie, TEWG Member
How do we measure the full range of teacher contributions to student engagement/connection? Among the many important points in the case for student evaluations as the primary measure of student engagement/connection are the following:
1) Student evaluations are highly reliable statistically (because they involve 25-35 raters in each class). For the class as a whole, high “reliability” indicates that the mean ratings of the students accurately convey the class’ perceptions of their experiences on each item rated.
2) The student raters observe the teacher daily, sometimes for as many as 175 days. Note the contrast with the amount of observation done by a trained evaluator or principal, whose observations necessarily have the potential to be unrepresentative of a teacher’s work. That problem also occurs if a teacher can prepare especially well for the class/hour during which a principal or trained evaluator is to observe or videotape the session. Further, the students are our primary stakeholders. If we truly respect our students, we ought to give credit to their systematically solicited input.
3) Student evaluations have been shown to be valid vis-à-vis measured achievement gains/value-added in math and reading, and may be better than trained observers’ evaluations of teachers at predicting this growth (both are findings from the MET Project, at grades 4 and 8). The validity of student ratings at grade levels K-3 are not as well researched/reported. In a recent e-mail, Mr. Ramsdell, the presenter of Tripod to TEWG, described their K-2 instrument’s responses as “reliable and valid,” but noted that they were also undertaking more research. The
K-2 surveys were said to cover contents similar to the Tripod surveys for elementary and secondary. Thus the K-2 surveys too can be said to be likely to have at least some validity vis-à-vis value-added achievement growth.
4) Student ratings can broadly encompass the full range of their own engagement/connection, including both what is seen and unseen. Thus student ratings are direct, whereas principals/evaluators/teachers have to infer and estimate teacher effects on student engagement/connection. Student ratings are also adaptable across disciplines, grade levels, the variety of possible teaching assignments, etc.
5) Student ratings provide important and specific feedback for teachers. Even when part of a summative assessment, most item ratings can be designed to be extremely useful as feedback for teachers, as a formative tool. Students can be asked to rate the effectiveness for them of the most important aspects of instructional delivery* as well as some outcomes (see #7). Each rating informs the teacher of how well his/her teaching is working relative to his/her comparison group (e.g., regular elementary classrooms at the same grade level, or within a disciplinary area in one of the two types of secondaries), so that teachers find out about their relative strengths and weaknesses as experienced by the students—exactly the information the teacher most needs to improve education for the students. Thus, more than anything else, student evaluations of teachers will lead to the desired improvements in teaching.
6) Student ratings dovetail perfectly with and extend the potential of the newer mentor support systems for teachers. Teachers receiving below-average results on their student ratings can be invited to ask for a mentor’s (or principal’s) help in their interpretation–which then leads to conversation about how to remedy such problems as are identified and/or requests for additional mentor observation/services.
7) Student ratings, unlike any other evaluative tool, are very important for recognizing the true range of outcomes/value delivered by teachers. For example, students can be asked about the extent of their learning (which they don’t limit to measured achievement gains), engagement in critical thinking in class discussions, whether their writing or speaking ability has improved, and a teacher’s effects on their motivation (to work at learning, persist in school), etc.** Even when no item specifically addresses such outcomes, including an “overall teaching effectiveness” rating does. For example, among beginning college students randomly assigned to introductory psychology classes, Koon and Murray (Journal of Higher Education, Vol. 66, No. 1, 1995, pp. 61-81) found that students’ ratings of “overall teacher effectiveness” were significantly related to motivational effects that were above and beyond those directly associated with achievement. Students had higher levels of achievement and subsequently took more courses in the field if their class of students had rated their teacher as higher in overall effectiveness in the introductory course.
8) There are not many other good measurement tools for assessing student engagement/connection for individual teachers. Most of the 21 sets of measures of student engagement that were made available to the Subcommittee had a school-wide focus, and even most of those were based on surveys of students. Apart from one quick 3-item measure for teacher ratings of their own students, instruments other than student surveys that are designed to assess individual teacher contributions to student engagement/connection have very limited coverage of the topic, typically require trained observers, and often are very time-intensive and/or expensive.
9) By reliably informing teachers of what is and isn’t working in their instructional delivery, and about the extent of several kinds of student gains as students perceive and experience them, student feedback surveys have considerable potential to help close the achievement gap. One of the main reasons for the recent emphasis on student engagement/connection is the inability of almost all schooling to substantially reduce achievement gaps. Student evaluations of teachers have the potential to create a much more accurate picture for teachers of their instructional effectiveness and provide more specific feedback for better engaging learners. We know that if students are more engaged they learn more.
10) If enough courses are evaluated, or enough years of data are included, student evaluations can enable analyses by demographic categories, providing teachers with information about their relative success with various subgroups, such as ethnicity, gender, etc.
11) Good/proper administrative procedures can virtually eliminate all possibility of overt abuse and collusion by students. For example, survey administrative procedures would bar conversation between students while anyone is still responding to the survey. We need the honest feedback from each student as an individual and survey item writers know how to design for this. One of the great dreads of teachers seems to involve the fear of unfair ratings by students who are thoroughly disconnected from the outset of the year, who act up in class (etc.). Even when this occurs, the mathematics involved indicate that the presence of one or two such students will not make much difference. This footnote shows why.***
12) Given the many valuable potential contributions of student ratings described above, it is clear that student evaluations should be included among the measures that are formally weighted into the summative evaluation of teacher effectiveness. If student ratings are not assigned any weight in the evaluation of teachers, they will tend to be greatly under-utilized and shunted aside even if part of a recommended mandate, not only because they don’t really count (just as subjects other than math and reading tended to suffer under NCLB) but also because they represent “unnecessary” costs to a district.
13) Whenever “high stakes” decisions are involved, an appeals system that encompasses student ratings as well as every other part of the teacher evaluation process should be part of the whole package.
14) Quality surveys need not be costly. Based on decades of research on college student evaluations of teachers—research which has involved numerous student rating forms—it seems quite safe to conclude that it is not necessary to use an established system such as Tripod to obtain reliable and valid student ratings of the engagement-connection. MDE could do quite well in developing its own student surveys (with follow-up research). Such surveys would have a common core of items for the elementaries, and for the secondaries, yet also have some variation to accommodate different disciplines and assignments (e.g., elementary specialist, ELL, music), and MDE would necessarily consult with the relevant teacher groups in developing the surveys. These surveys would then be owned by Minnesota, avoiding the year-to-year cost of buying one such as Tripod, which will have much higher costs per class. A good program of student evaluations won’t be cheap, but considering their value and the vast amounts of money to be spent on peer mentors, trained evaluators, and ever-changing achievement tests, they are a bargain.
*Aspects of instructional delivery include such things as: adequacy of teacher explanations; the extent to which the teacher seems prepared; teacher’s enthusiasm for the subject; extent of teacher encouragement of in-depth discussion, or questions; teacher openness-availability-willingness to meet; timeliness and adequacy of feedback on work submitted; clarity of expectations; whether tests and grading seem fair; extent of challenge; extent of linkages to related subject matters; inclusiveness with respect to ethnicity (and gender); the extent to which the teacher seems respectful and/or caring toward the students; whether too much class time is wasted due to disruptive behavior and side conversations; how much the homework assignments contribute to student learning; etc.
**Per Sandy Christenson, Prof. of Educational Psychology, University of Minnesota, and perhaps the chief consultant to the Legislature on the matter of student engagement-connection, enhancing student motivation, in its many facets, is the key area of contribution by individual teachers. Her paper includes 11 bullets identifying a wide range of such contributions.
***Suppose a teacher tends to get “overall effectiveness” ratings of 4.30 on a 1-5 scale. Suppose further that a 4.50 rating, based on comparisons with teacher ratings in the same disciplinary area, represents a very good teacher, bordering on excellent. And suppose that the teacher’s class size is 30. Now suppose further that 14 raters assigned a “4,” and 14 assigned a “5” (producing the usual mean of 4.50), but that there were also two “1” ratings given by two students who simply dislike school and everything to do with it. What happens to that teacher’s mean rating as a result? It becomes 4.27. Not a big change; comparatively speaking, this teacher’s ratings are now between good and very good. (Note further that Special Education classes would need to be compared among themselves, not with other classes. See also #13.)