Print | Email |
Share |

Rigor mortis

The word rigor comes up a lot in teacher-evaluation systems. It’s akin to motherhood, apple pie and the American flag. What policymaker is going to take a stand against rigor? But the term is getting distorted almost beyond recognition.

In science, a rigorous study is one in which the scientific claims are supported by the evidence. Scientific rigor is primarily determined by the study’s design and data-analysis methods. It has nothing to do with the substance of the scientific claims. A study that concludes that an educational program or intervention is ineffective, for example, is not inherently more rigorous than one that concludes that a program works.

In the current discourse on teacher-evaluation systems, however, an evaluation system is deemed rigorous based either on how much of the evaluation rests on direct measures of student-learning outcomes, or the distribution of teachers into the various rating categories, or both. If an evaluation system relies heavily on NCLB-style state standardized tests in reading and mathematics—say, 40 percent of the overall evaluation or more—its proponents are likely to describe it as rigorous. Similarly, if an evaluation system has four performance categories—e.g., ineffective, developing, effective and highly effective—a system that classifies very few teachers as highly effective and many teachers as ineffective may be labeled rigorous.

In these instances, the word rigor obscures the subjectivity involved in the final composite rating assigned to teachers. The fraction of the overall evaluation based on student-learning outcomes is wholly a matter of judgment; and if you believe, as I do, that a teacher’s responsibility for advancing student learning extends well beyond the content that appears on standardized tests, you could conceivably argue that increasing the weight given to standardized tests in teacher evaluations makes these evaluations less rigorous. This is, however, a hard sell in the absence of other concrete measures of student-learning outcomes that could supplement the standardized-test results.

Even more importantly, describing a teacher-evaluation system as rigorous hides the fact that the criteria for assigning teachers to performance categories—either for subcomponents or for the overall composite evaluation—are arbitrary. There’s no scientific basis for saying, as New York has, that of the 20 points out of 100 allocated for student “growth” on New York’s state tests, a teacher needs to receive 18 to be rated “highly effective,” or that a teacher receiving 3 to 8 points will be classified as “developing.” In fact, the cut-off separating “developing” from “effective” changed last week as a result of an agreement reached between the New York State Education Department and the state teachers’ union—not because of science, mind you, but because of politics.

And it’s politics, and politics alone, that accounts for the fact that the rules for the overall composite evaluation say that any teacher who scores 0 to 64 points will be classified as ineffective, and that the two subcomponents for student “growth” and local assessments, each of which counts for 20 points, classify teachers who score 0 to 2 points on each component as ineffective. This means, as New York principal Carol Burris and others have pointed out, that if a teacher is classified as ineffective on both of these subcomponents, that teacher is automatically rated ineffective overall, even if that teacher is rated highly effective on the 60 points allocated for measures of a teacher’s professional practices. It certainly seems odd that two components accounting for 40 percent of a teacher’s overall rating can trump the remaining 60 percent –but this isn’t science, it’s politics.

Other states face the same challenge in assigning teachers’ value-added scores or student growth percentile scores to performance categories, and most of them have punted, issuing regulations that defer these difficult decisions until later. Illinois says that it’s “working diligently” on this. Georgia claims that its model will be identified soon. Michigan is counting on a rating system to be developed by the Governor’s Council on Educator Effectiveness. After a year of debate, Delaware concluded that it couldn’t figure out how to use students’ scores on the state assessment system in teachers’ summative ratings for the 2011-12 school year, and deferred implementation until the future.

It violates a basic principle of fairness for teachers to be held accountable for performance criteria that aren’t clearly specified in advance and that may be unattainable. These states, and many others, have their work cut out for them.

Nowhere is this more evident than with the mapping of teachers’ value-added or student growth percentile scores onto the ratings composing a teacher’s summative evaluation. The value-added or student growth percentile scores are measured with errors that can be substantial, especially when they are based on a single year’s worth of student achievement data. But the scoring bands for ratings categories such as “developing” or “effective” have strict cut-offs. What to do?

One way of reclaiming the concept of rigor in teacher-evaluation systems is to assign ratings that take into account the uncertainty or errors in the measures. This is consistent with a scientific conception of rigor: the assignment of teachers to rating categories should be consistent with the quality of the evidence for doing so. A teacher shouldn’t be assigned a rating of “ineffective” based on a value-added score, for example, if there’s a substantial probability that the teacher’s true rating is “developing.”

So here’s a challenge, and a proposal. The challenge is to state education policymakers across the country who have hitched their teacher-evaluation systems to measures that seek to isolate teachers’ contributions to their students’ learning: Develop clear and consistent guidelines for assigning teachers to rating categories that take into account the inherent uncertainty and errors in the value-added measures and their variants.

And here’s the proposal:  A teacher should be assigned to the lower of two adjacent rating categories only if there is at least 90 percent confidence that the teacher is not in the higher category. Operationally, this involves a statistical test based on a cut score, a teacher’s score and the error associated with that score.

Suppose, for example, that the cut-off separating “ineffective” and “developing” is a teacher being in the 10th percentile across the state on a value-added or student growth percentile measure. Teacher A’s percentile rating is the eighth percentile, but the standard error for her rating is two percentile points. Given the uncertainty in the rating, there is a 16 percent probability that Teacher A’s true percentile rating is greater than the 10th percentile, and an 84 percent probability that her true percentile is lower than the 10th percentile. Thus, in my proposal, Teacher A should be classified as developing, not ineffective.

Conversely, Teacher B’s percentile rating is in the fourth percentile, and the standard error for her rating is three percentile points. Given the uncertainty in the rating, there is only a 2 percent probability that Teacher B’s true percentile value is above 10, and a 98 percent probability that his true percentile rating is lower than the 10th percentile. Teacher B would therefore be classified as ineffective.

Other approaches are certainly viable; the 90 percent confidence rating is arbitrary, but one that seems sensible to me. In most educational, social and medical research, a common standard is to trust an observed effect only if that effect could be observed by chance under 5 percent of the time, relative to the hypothesis that there’s no true effect in the population. The 90 percent standard I’m proposing is slightly more lenient. And of course this approach doesn’t address the arbitrariness in the New York scheme described above.

If policymakers aren’t willing to take measurement error into account in a defensible way in teacher-evaluation systems, don’t talk to me about rigor—rigor is dead.


POSTED BY ON February 20, 2012

Comments & Trackbacks (7) | Post a Comment

Sally Kilgore

Excellent piece, Aaron. Am sharing with some involved in the very issues you write about!!!

Fred Smith

Thank you for this contribution, Aaron.

As you pointed out, various schemes have been/are being devised using test scores to evaluate teachers. On the surface the numbers provide a false sense that objectivity, quantitative reasoning and scientific precision are guiding the rating process. Peek behind the curtain and find the elaborate grading systems rest on arbitrary decisions born of numerology and political calculation.

But the politicians are clever. Just as they have misused flimsy data to serve their purposes, so too have they corrupted language and seized upon slogans to mislead.

The words “accountability,” “transparency,” “standards” and “value-added” come to mind. And who can argue with No Child Left Behind, Children First and Race to the Top. As you point out–add “rigor” to the list. Did someone say Orwell!

Let me just extend how “rigor” has been highjacked. Since the impossibly high 2009 New York State test results were released, the Board of Regents has conspicuously directed its State Education Department to administer more rigorous tests.

Sure enough, in 2010 and 2011 the statewide and citywide percentages of students deemed to be proficient in reading and math fell precipitously. These reversals were attributed to tougher exams–when, in fact, the decreases reflected nothing more than an increase in the cut off scores needed to reach the threshhold of proficiency. The same decision could have been made in 2009 (an election year).

A closer examination of the data reveal that the items on the test hadn’t gotten any tougher. Raising the cut off scores was sold to the media as a return to higher standards–but its hard to reconcile this claim in light of the easier items.

The scariest part is that the test instruments, their results and the way they are being weighted (without regard to measurement error) are the foundation upon which students, teachers and schools are being judged. Given how high the stakes have become, we cannot afford to have testing systems that are so compromised.

Lucy Scott

Great article. It should be forwarded to Colorado Department of Education, since this is the new eval system they want to implement for teachers.

Dave F. Brown

These “arbitrary and objective” methods of evaluating teachers are just that—pseudo science if not voodoo science. The education profession has been evaluating teachers based on the research of effective pedagogy for the past 20 years at least. As an educational researcher who has been disseminating accumulated data for 20 years, we have made great strides in determining what effective teachers do—the actual behaviors they demonstrate that reflect effective pedagogy–the science of teaching. These are not secrets to educational professionals—they are guidelines that educators and teacher evaluators have followed for decades—improved frequently through new cognitive research findings. No educator—principal of teacher—should EVER be evaluated based on students’ test scores. No researcher has ever been able to isolate the variables that affect a child’s or adolescent’s test scores; variables that include prenatal care, preschool learning opportunities, local environmental pollutants (from lead to poor air quality to poor water quality), parental support for learning, socioeconomic status, students’ attitudes and efforts, and community support for education from the tax support to the daily homework support. There is NOT a learning achievement gap created by teachers—there are funding gaps, community job support gaps, community resources gaps, parental responsibility gaps, and politicians’ tax support gaps. Let’s stop blaming teachers for the gaps created by local, state, and federal lack of support for changing the poor living conditions that exist in many rural and urban communities. Let’s face it—education is NOT a priority for our legislators who want to save tax dollars. As long as the U. S. continues to be in about 10th place among industrialized nations in economic support for education, the alleged “achievement gap” will continue and no effective pedagogy will fix the problems. These federal guidelines and new state policies for evaluating teachers are absurd and unnecessary to the education professionals who know what effective teaching looks like. See the book, WHY AMERICA’S PUBLIC SCHOOLS ARE THE BEST PLACE FOR KIDS at http://www.rowman.com for the rest of the reasons educators don’t need federal, state, or Bill Gates’s suggestions for evaluating teachers!

Multiple Measured Madness « Opine I will

[...] the issue of using value-added scores in teacher evaluations is found in the Hechinger Report blog post of Columbia University sociologist Aaron Pallas. Pallas sensibly suggests that where value-added [...]

[...] The article is available. [...]

[...] the issue of using value-added scores in teacher evaluations is found in the Hechinger Report blog post of Columbia University sociologist Aaron Pallas. Pallas sensibly suggests that where value-added [...]

Your email is never published nor shared.

Required
Required