Friday, March 25, 2011

Performance-based assessments: A brave, new world

Performance assessments. Performance-based assessments. Authentic assessments. Constructed response, open-ended, performance tasks. Our field has devised many terms to describe assessments in which examinees demonstrate some type of performance or create some type of product. Whatever you call them, performance-based assessments (PBAs) have a long history in educational measurement with cycles of ups and downs. And once again, PBAs are currently in vogue. Why?

To address the federal government’s requirements for assessment systems that represent “the full performance continuum,” the two consortia formed in response to Race to the Top funding have both publicized assessment plans that involve a heavy dose of performance-based tasks. Thus, PBAs are relevant to any discussion about the future of testing in America.

The old arguments in favor of PBAs are still appealing to today’s educators, parents, and policy-makers. Proponents claim these types of tests are more motivating to students. They provide a model for what teachers should be teaching and students should be learning. They serve as professional development opportunities for teachers involved in developing and scoring them. They constitute complex, extended performances that allow for evaluation of both process and product. Moreover, performance-based tasks provide more direct measures of student abilities than multiple-choice items. They are able to assess students’ knowledge and skills at deeper levels than traditional assessment approaches and are better suited to measuring certain skill types, such as writing and critical thinking. They are more meaningful because they are closer to criterion performances (Or so the story goes. To be fair, these are all claims requiring empirical validation.)

Despite their recent renaissance, PBAs have well-known limitations: lower reliability and generalizability than selected-response items, primarily because of differences in efficiency between the two task types (one hour of testing time buys you many fewer performance-based tasks than multiple-choice items). But these limitations also arise because PBAs are frequently scored by humans—a process that introduces a certain amount of rater error. In exchange for greater depth of content coverage, PBAs compromise breadth of coverage. Generalizability studies of PBAs have found that significant proportions of measurement error are attributable to task sampling, manifested in both person-by-task interactions and person-by-task-by-occasion interactions (in designs incorporating occasion). Again, this is largely because there are many fewer performance-based tasks on any given test.

PBAs are used in a variety of contexts, including summative, high-stakes contexts, such as certification and licensure, as well as employment and educational selection. PBAs are also used for formative or instructional purposes. When well-designed PBAs are administered and scored in the classroom, they can provide information for diagnosing student misconceptions, evaluating the success of instruction, and planning for differentiation because of their instructional alignment and because student performances offer windows into students’ thinking processes. In high-stakes contexts, strict standardization of task development, administration, and scoring is critical for promoting comparability, reliability, and generalizability. In classroom assessment contexts, such rigid standardization may be relaxed. Clearly, what makes a particular PBA useful for one context will make it less so for the other. For example, standardization of task development, administration, and scoring (which is impractical in classroom settings anyway) moves assessment further from instruction and makes it less amenable to organic adjustment by the teacher to meet student needs. In turn, the unstandardized procedures typically favored in classroom settings—extended administration time, student choice of tasks, student collaboration—introduce construct-irrelevant variance and diminish the comparability of tasks that is necessary for high-stakes contexts.

Bottom line: PBAs are here for the foreseeable future. The measurement community needs to revise its expectations about the reliability of individual assessment components. PBAs will prove less reliable than traditional assessment approaches! However, we should move forward with the expectation that this compromise in reliability means an upgrade in terms of greater construct validity for skills not easily assessed using traditional approaches. In addition, I suggest we focus on the reliability, comparability, and generalizability of scores and decisions emanating from assessment systems that incorporate multiple measures representing multiple assessment contexts taken over multiple testing occasions (e.g. through-course assessments).

Moreover, there are ways of making PBAs—even those administered in the classroom—more reliable and comparable. Although pre-service teacher training in measurement and assessment is notoriously weak (e.g., see Rick Stiggins’ work), teachers can be taught to design assessments according to measurement principles. For example, carefully-crafted test specifications can go a long way in creating comparable tasks. Although the measurement field has traditionally avoided classroom assessment, I suggest we consider participating in collaborative initiatives to create curricula with psychometrically-sound, embedded PBAs (e.g., see Shavelson and company’s SEAL project.

Doing this well will require new assessment development models that incorporate close collaboration between curriculum designers and assessment developers to ensure tight alignment and seamless integration of assessment and instruction. Such models will also require closer collaboration between the content specialists who write the tasks and the psychometricians charged with collecting evidence to support overall assessment validity and reliability. Finally, such embedded assessments will need to be piloted—not only to investigate task performance, but also to obtain feedback from teachers regarding assessment functionality and usefulness.

It’s a brave, new world of assessment. To truly advance and sustain these developments, we need to start thinking in brave, new ways. Such an approach will help ensure that the current wave of performance assessment has more staying power than the last.

Emily Lai
Associate Research Scientist
Test, Measurement & Research Services
Assessment & Information

Monday, March 14, 2011

Teacher Effectiveness Measures: The Tortoise and the Hare

Several recent initiatives have fueled a firestorm of debate around measuring the effectiveness of our teachers. In the competition for billions of dollars through both Race to the Top and the Teacher Incentive Fund, responding states and districts were required to propose measures of teacher effectiveness that incorporate student growth data. Most of the applicants that were awarded these funds proposed weighting student data up to 50% within the measure. In an effort to increase teacher accountability, high stakes have been proposed for these effectiveness measures. They may be used to make decisions related to employment, like promotions and dismissals, as well as monetary bonuses.

In my opinion, there have been two kinds of responses to these teacher effectiveness measures. First, there’s been what I classify as the “sidelines” response where some researchers and teachers simply don’t want to take part in the development of these measures. Researchers claim that the value-added models that have been proposed to estimate teacher effects based on student data are flawed, whereas teachers claim that their work cannot be accurately and fairly reflected by student test scores alone. While these points are legitimate, those on the sidelines don’t tend to suggest alternative solutions.

In stark contrast to the sidelines response, there is also the response that I liken to the hare from the famous fable. Some policy-makers and vendors have ignored the debates and taken off in the race to provide a solution to teacher effectiveness. By moving directly to implementation of such measures, the policy-makers claim they will encourage reform within schools, and the vendors are happy to accept the new business.

I have followed both responses for nearly a year. Initially, I fell in line with many colleagues and wanted no part of using statistical models and student data to produce estimates of teacher effectiveness. I am now resigned to the fact that the sidelines response is unproductive. Refusing to participate does not impact the inevitability of teacher effectiveness measures and can be seen as a refusal to contribute to a solution.

So now that I’m ready to step off the sidelines and engage in teacher effectiveness measures, I realize that those who chose the hare response are so far ahead that they are no longer visible on the horizon. But perhaps they have paused to take a nap. By producing a solution to teacher effectiveness so quickly, perhaps this group has failed to provide the research and documentation needed to support and sustain such measures. Likely, they did not engage stakeholders to elicit feedback about what effective teaching really means. There has not been time to conduct validity studies that empirically link their measure to results in the classroom. While the hares in our industry may have gotten off to the fastest start when it comes to teacher effectiveness, I suspect that their lead will not last.

I suggest that neither the sidelines nor the sprint are advisable. Rather I advocate for what could be called the tortoise approach. It’s not a quick-fix solution, but with more time, a valuable and defensible measure of teacher effectiveness can be defined and established.

I suggest we tortoises follow comparable steps and standards to those used for assessment development:

  • The first step could be similar to a content review, where stakeholders convene to discuss and identify the essential components of teacher effectiveness.

  • Then content experts can partner with psychometricians to design and refine measures of these essential components.

  • Next, the measures could be piloted and validity studies could be performed. Standard setting could be used to establish the line that divides effective from ineffective teaching.

  • Custom reports could compare individual teacher performance to that of the schools’ teachers or teachers with similar student populations.

  • Professional development activities could be offered to help schools improve the skills of lower performing teachers.
Each of these steps takes time but also provides essential information needed to develop a valid and useful measure of teacher effectiveness.

Without sufficient evidence to support the defensibility and validity of the hares’ quick-fix solutions to teacher effectiveness, I wonder if they will even reach the halfway point, let alone the finish line. In contrast, we tortoises can continue to move forward building measures with known procedures, valuable stakeholder input, and informative data analysis. Let’s heed the advice of The Tortoise and the Hare and work to establish a quality measure of teacher effectiveness rather than the fastest solution.

Tracey Magda
Evaluation Systems
Assessment & Information