Monday, July 31, 2006

NCLB Testing: About Learning, Not Standings

The No Child Left Behind Act (NCLB) is perhaps the most sweeping and controversial educational reform ever enacted. As professionals working in the testing industry, we have both benefited and suffered from this legislation. To be sure, there are aspects of NCLB that are less than ideal, but sometimes criticisms of NCLB are taken too far beyond the facts.

Recently, a commentary highly critical of NCLB was published in the Wall Street Journal. The author, Charles Murray, claims that NCLB is “a disaster for federalism” and “holds good students hostage to the performance of the least talented.” Murray cites a report from the Civil Rights Project at Harvard University, which concludes that NCLB has not improved reading and math achievement as measured by the National Assessment of Educational Progress (NAEP). Murray further argues that although many state assessments show decreases in black–white achievement gaps, these decreases are meaningless because they are statistical artifacts based on changes in pass rate percentages rather than differences in test scores.

Murray’s criticism confounds the idea of measuring students against standards (criterion-referenced testing) with measuring students relative to a group (norm-referenced testing). In a norm-referenced system, if scores for white and black students increase the same amount, clearly the gap between the groups is not closing. But Murray fails to understand the basic tenets of standards-based assessment that provide the framework for NCLB. In each state, assessments are built to measure the state content standards, the very same content standards that schools are expected to use in their instruction. In addition, each state sets achievement standards that establish how well students must perform on the assessments to be considered “proficient” and “advanced.”

In a standards-based assessment, if black and white students improve equally, the percentage of blacks achieving proficient will, over time, increase more than the percentage of whites achieving proficient. Murray is correct to say this is “mathematically inevitable.” But, is it meaningless? Is it meaningless that more black students who were not proficient are proficient now? Is it meaningless that more minority students are learning fundamental reading and math skills that they weren't learning before?

By disdaining measuring students against standards and embracing norm-referenced measurements, Murray’s arguments evoke stereotypical assumptions—notably the assumption that there is “a constant, meaningful difference between groups” (as if this is some natural law and not based on divisions of class, privilege, and access to resources) and the statement that “they cannot all even be proficient” (i.e., what’s the point of trying, they can’t learn anyway).

In Texas, standards-based assessment preceded NCLB by more than a decade. Over time, Texas policy makers have revised their assessments several times. Their most recent program, the Texas Assessments of Knowledge and Skills (TAKS), was introduced in 2003. It is based on tougher content standards and imposed tougher achievement standards than any prior Texas assessments. Texas also did something unusual when they introduced TAKS: they phased in their tougher achievement standards over several years, starting with standards that were two standard errors of measurement (SEM) below the level recommended by the standard setting panels. Table 1 shows the percentage of students passing the exit-level mathematics test between 2003 and 2006 based on different standards: 2 SEM below the panel recommendation, 1 SEM below the panel recommendation, and at the panel recommendation. The bold pass rates correspond to the standard that was used in a particular year.

Table 1: Percent of Students Passing TAKS Grade 11 Mathematics – Spring 2003 to Spring 2006


Murray would probably be delighted with Table 1, because just as he predicted, the difference between black and white pass rates depends on the standard that is applied. On the other hand, Murray would also have to admit that passing percentages of both blacks and whites improved each year, regardless of the performance standard one might care to apply. The rise in test performance indicates that students are being taught the necessary skills they weren’t learning before. This is far from meaningless.

One aspect of Table 1 that Murray might take special note of is the rather astonishing increase in pass rates between 2003 and 2004. This increase did not surprise Texas educators at all. It turns out that the requirement to pass the exit-level TAKS tests did not apply in the first year of testing. Thus, the dramatic increase in passing rates between 2003 and 2004 is probably due in part to instructional changes and in part to changes in student motivation. It also provides some context for considering the use of NAEP scores as criteria for evaluating NCLB: NAEP does not measure any state’s content standards and there is little or no incentive for students to give their best performance. As the only national test available, NAEP is a convenient and available yardstick, but it was not designed to evaluate state assessment systems and its use for that purpose is of limited validity.

The politics of NCLB are complex and tend to encourage extreme positions. In calling NCLB “uninformative and deceptive,” Charles Murray has taken an extreme position that fails to recognize the rationale and merits of standards-based assessment. Irrespective of one’s views about NCLB, it is important that the public debate be an informed one, and the rhetoric of Charles Murray misses much of the issues that matter.

Monday, July 24, 2006

APA in the "Big Easy"

In an effort to continue the recovery following hurricane Katrina, the American Psychological Association (APA) has scheduled its annual convention in New Orleans, August 10-13. There are still hotel rooms available, and it promises to be a good conference. (See the brochure on the APA website.)

This conference is not just for psychologists. In fact, the College Board's Dr. Kathleen Williams and yours truly will be presenting a paper on the affinity of human scoring relative to individualized assessment administration. This session will focus on the similarities between the large-scale scoring of writing assessments by human readers and the scoring of individualized assessments typically requiring a professionally trained examiner. For more information, search the online program for the session title "Challenges in Scoring Open-Ended Responses" on the APA Convention website.

Wednesday, July 19, 2006

Been Thinking About...

Dr. Joshua Aronson, an associate professor at NYU Steinhart, was the keynote speaker at our last Iowa Educational Research and Evaluation Association (IEREA) conference, and I have been thinking about what he said ever since.

From his address, it is fair to say that Dr. Aronson does not believe that what psychometricians and test builders do to eliminate potential bias works. Rather, he talks about a bias that he labels "stereotype threat." That is, the elimination of stereotypes and biases is not enough—as long as the examinee thinks there might be bias, they behave as if there were bias. Dr. Aronson provided the following quote:

"I knew I was just as intelligent as everyone else...but for some
reason I didn't score well on tests. Maybe I was just nervous. There's
a lot of pressure on you, knowing that if you fail, you fail your race."

–Texas State Senator Rodney Ellis

In experimental settings, Dr. Aronson and his colleagues have shown that when stereotype threat is present, minority students underperform relative to when stereotype threat is not present. For example, African-American students doubled their performance solving problems on verbal tasks when they were not asked to indicate their race. When they were asked to provide their race, their performance was cut in half!

As measurement experts, we are diligent regarding our procedures to make assessments fair. We need to keep thinking about ways we can improve.

Tuesday, July 11, 2006

CCSSO Session Results

I found this year's CCSSO Large-Scale Assessment Conference particularly useful. I have been a critic of this conference in the past, believing that there were too many people pontificating about "how" assessment should be done without actually having done any themselves. This year, however, the conference was a very good mix of policy/political insight, applied measurement research, program advice, and empirically driven "theoretical" research. In addition, the sessions I attended were standing room only, indicating that some interest was peaked in the attendees. This must have been particularly true given such a nice venue as San Francisco as a distraction.

Don't take my word for it. Look over the papers and presentations posted at the CCSSO website. The sessions I attended or that were reported to be most interesting included: Monday Session 74 - "Using Technology to Create Innovative State Science Assessments: Pilots and Policy"; Monday Session 125 - "Measures of Student Achievement, Vertical Articulation, and the Realities of Large-Scale Assessment"; and Tuesday Session 38 - "What's Next In Online Testing?".

Thanks to CCSSO for posting these papers. This is a service we should all be excited about.

Thursday, July 06, 2006

First "Bulletin" Now Available

Pearson Educational Measurement (PEM) is proud to announce the first issue of our newly created Pearson Educational Measurement Bulletin. Our intent is to further the understanding of our industry and our profession by providing real-world explanations on pertinent topics related to test development, psychometrics, and educational assessment.

The first issue describes, in a non-technical way, the facts surrounding the current best measurement practice known as universal design (See PEM Research Report 05-04 for more information). This document links the requirements of NCLB with the desire to build the "least restrictive assessment environment" such that all students can participate in educational assessments fairly. Through universal design, assessments (both paper-based and electronic) will become more valid (supporting stronger inferences from student assessment results) because nonconstruct related variance will be reduced.

Written by researchers at PEM, this is a good introduction to universal design for those unfamiliar or needing more clarification, those needing a quick refresher, or those with familiarity who need a brief reference or resource.

New issues will be posted regularly and will cover a vast array of subjects. Some will be simple answers to frequently asked questions. Others will be more instructional with step-by-step guidance on your favorite measurement topics. You might even mistake some papers for empirical research! The only way you can tell is by dropping by our website from time to time to see what new topics have been posted. Or, you can make it easier by signing up to be notified of new releases.

Saturday, July 01, 2006

National Education Computing Conference (NECC) in Sunny San Diego!

One conference that gets some attention from trade organizations but really provides rich information regarding technology, assessment and learning, is the annual National Education Computing Conference(NECC). This year's conference is in San Diego, July 4-7.

A host of Pearson organizations will be present with information and demonstrations: Pearson Educational Measurement (PEM), Pearson School Systems (including PowerSchool and Chancery) and others.

PEM will be presenting our Perspective series of informative score reporting services, PASeries Writing and Algebra I formative assessments, and enhancements to other PASeries assessments. Certainly too much fun to pass up!