Monday, June 29, 2009

Pearson is Fulfilling the Goal to be the Nation’s Thought Leader in Assessment

One of primary objectives of Pearson as the leading provider of educational measurement research is to lead the effort on effective educational policy discussion. Sometimes these efforts are clearly articulated in customer facing actions (like legally defensible setting of student performance standards), academic research publications or conference presentations. Other times, policy and/or position papers are prepared to inform our customers and others regarding the direction Pearson is steering education. I was recently involved in the development of such a paper and wanted to share it with you in this post

“Using Assessments to Improve Student Learning and Progress” is a very interesting paper that clarifies the roles of large-scale, high-stakes assessments as contrasted with classroom assessments. While I have made such comparisons in other TrueScores posts, this paper is much more comprehensive.

Here is a brief except of the distinctions made in the paper:
“Assessments for learning provide the continuous feedback in the teach-and-learn cycle, which is not the intended mission of summative assessment systems. Teachers teach and often worry if they connected with their students. Students learn, but often misunderstand subtle points in the text or in the material presented. Without ongoing feedback, teachers lack qualitative insight to personalize learning for both advanced and struggling students, in some cases with students left to ponder whether they have or haven’t mastered the assigned content.”
This paper also contains links to other Pearson related efforts to inform and shape public policy and opinion as evidenced from the follow except:

“Assessments for learning are part of formative systems, where they not only provide information on gaps in learning, but inform actions that can be taken to personalize learning or differentiate instruction to help close those gaps. The feedback loop continues by assessing student progress after an instructional unit or intervention to verify that learning has taken place, and to guide next steps. As described by (Pearson authors) Nichols, Meyers, and Burling:
‘Assessments labeled as formative have been offered as a means to customize instruction to narrow the gap between students’ current state of achievement and the targeted state of achievement. The label formative is applied incorrectly when used as a label for an assessment instrument reference to an assessment as formative is shorthand for the particular use of assessment information, whether coming from a formal assessment or teachers’ observations, to improve student achievement. As William and Black (1996) note: ‘To sum up, in order to serve a formative function, an assessment must yield evidence that…indicates the existence of a gap between actual and desired levels of performance, and suggests actions that are in fact successful in closing the gap.’”

This quote also shows how the Pearson themes are indeed consistent in that personalized learning is supported through the Pearson "teach and learn" cycle as informed by assessment—one Pearson's primary goals. So, go check it out!

Monday, June 22, 2009

Universal Design for Computer-Based Testing Guidelines

This was just posted on the Pearson website. Pearson’s Universal Design for Computer-Based Testing Guidelines examines the specific student challenges related to each test question construct and pinpoints question design solutions that can make test questions more accessible to all students. The study touts the value of digital technology and its ability to incorporate multiple representations, such as text, video and audio, into computer-based testing.

Thursday, June 04, 2009

Pearson Sessions at CCSSO (Updated)

Assessing Writing Online: The Benefits and Challenges

The transition to assessing student writing online presents both benefits and challenges to states and their students. This session will discuss the logistical, content, scoring, and political issues states face while implementing the transition to assessing student writing online. Louisiana will provide insight on implementing an online writing assessment and Minnesota will discuss the challenges it faced while attempting to develop an online writing component. In addition, research and content-based perspectives will be offered from two testing companies, Pearson and Pacific Metrics.

Presenters:
Jennifer Isaacs Pacific Metrics Corporation
Denny Way, Pearson Education
Claudia Davis, Louisiana Department of Education
Dirk Mattson, Minnesota Department of Education

Three States’ Experiences in Implementing a Vertical Scale

In this session we describe the methods used by three states to implement vertical scales in reading and mathematics across grades 3 through 8. A vertical scale has become a desirable component of a state’s assessment program in recent years because schools that do not meet their AYP requirements in terms of the number of students meeting standards may still be counted as meeting the AYP requirements if they can show that acceptable progress has been made. In Virginia, tests are administered online and on paper and a vertical scale had to be created that would be applicable to both modes. In Texas, tests are administered in English and Spanish so two different vertical scales were developed. Mississippi is implementing a multi-year plan which includes developing vertical linking items that reflect the progression of content through the curriculum and monitoring student performance prior to reporting results on the vertical scale.

Presenters:
Steve Fitzpatrick, Pearson
Ahmet Turhan, Pearson
Kay Um, Pearson

Accommodations in a Computer-Based Testing Environment

States increasingly are delivering or considering delivery of assessments via computer. Some states are working towards (or have fulfilled) a dual administration model while others are exploring using only computer-based testing in the future. Use of the computer opens the door to technological solutions for accommodations. However, the use of technology for accommodations also raises a number of questions, such as ease of use and score comparability. This session will discuss the research and development of computer-based accommodations, such as text-to-speech and onscreen magnification, and discuss the policy and practical issues/potential solutions surrounding their development and use in a secure testing environment.

Presenters:
John Poggio, University of Kansas
Bob Dolan, Pearson
Shelly Loving-Ryder, Virginia Department of Education
Todd Nielsen, Iowa Testing Programs
Discussant:
Sue Rigney, U.S. Department of Education

Ensuring Technical Quality of Formative Assessments

Reference to an assessment as formative is shorthand for the formative use of assessment data—whether coming from standardized tests, teacher observations, or intelligent tutoring systems—with the explicit goal of providing focused interventions to improve student learning. The technical quality of assessments indicate the extent to which interpretations and decisions derived from assessment results are reasonable and appropriate. However, familiar technical requirements such as validity and reliability have been developed with a focus on summative assessment and have not considered a coordinated system of instruction and assessment. For example, reliability has traditionally indicated the degree of consistency of test scores over replications, a definition that would inadequately describe a formative assessment that inconsistently prescribes appropriate targeted instruction over time and across conditions. This session will discuss different approaches to defining new indicators of technical quality appropriate for ensuring the effectiveness of formative assessment systems.

Presenters:
Bob Dolan, Pearson
Meg Litts, Onamia Public Schools, MN
John Poggio, University of Kansas
Jerry Tindal, University of Oregon
Discussant:
Tim Peters, New Jersey DOE

Application of Validity Studies for College Readiness: the American Diploma Project Algebra II End-of-Course Exam

Too many students graduate from high school unprepared for college—nearly one-fourth of first-year college students must take remedial courses in mathematics. An intended use of the American Diploma Project (ADP) Algebra II End-of-Course Exam is to serve as an indicator of readiness for first-year college credit-bearing courses—to ensure that students receive the preparation they need while they are still in high school. This session will discuss how the multistate ADP consortium gathered validity evidence for this use of the exam through studies involving: 1) content judgments of college instructors; 2) comparisons of college students’ ADP Algebra II End-of-Course Exam performance with subsequent grades in college level courses; and 3) empirical relationships between exam performance and ACT or SAT scores. Representatives from two ADP states, Achieve, and the exam vendor will share their experiences with collecting the data and how results of the exam will be used.

Presenters:
Julies A. Miles, Pearson
Nevin C. Brown, Achieve
Stan Heffner, Ohio Department of Education
Rich Maraschiello, Pennsylvania Department of Education

Keeping All Those Balls in the Air: Challenges and Approaches for Linking Test Scores Across Years in Multiple Format Environments.

In this era of requiring annual improvements in student test scores, valid test score linking is one of the most important components of a state testing program. To make things more challenging, many states are implementing computer-based testing programs, but few if any are able to completely shift to CBT platforms. Therefore, in addition to the usual challenges with test score linking, states are trying to ensure that comparable inferences are drawn from the same scale scores no matter the format of the test. This session brings together equating contractors and testing directors from two states that have successfully addressed these challenges to share their lessons learned and offer recommendations to other state leaders.

Presenters:
Scott Marion, NCIEA
Matt Trippe, HumRRO
Deborah Swensen, Utah State Office of Education
Tony Thompson, Pearson
Dirk Mattson, Minnesota Department of Education
Discussant:
Rich Hill, NCIEA


Revising the Standards for Educational and Psychological Testing

AERA, NCME and the APA have launched an effort to revise the Standards for Educational and Psychological Testing. Several members of the committee drafting this revision will describe issues being addressed and the process and timeline for completing our work. The charge to this committee specifies areas of focus including: (a) the increased use of technology in testing, (b) the increased use of tests for educational accountability, (c) access for all examinee populations, and (d) issues associated with work-place testing. The committee will also review the scope and formatting of the Standards. A state testing director will be invited to serve as discussant describing implications of the Test Standards for state assessments. A significant part of the session will be devoted to questions and comments from the audience.

Presenters:
Lauress Wise, HumRRO
Brian Gong, NCIEA
Linda Cook, ETS
Joan Herman, CRESST
Denny Way, Pearson
Discussant:
John Lawrence, California Department of Education


Lessons Learned and the Road Ahead for the ADP Algebra II Consortium

Students across 12 states took the American Diploma Project Algebra II End-of-Course Exam for the first time in Spring 2008. In August 2008, the first Annual report of the results and findings was released. This session will focus on the lessons learned from the first administration, progress towards the 3 common goals set for this endeavor by participating states and the challenges ahead for the multi-state consortium. Among topics to be discussed are the validity studies that are being conducted to help establish the exam as an indicator of student readiness for first-year College credit bearing courses as well as how the states are using the assessment to improve high school Algebra II curriculum and instruction.

Presenters:
Laura Slover, Achieve
Shilpi Niyogi, Pearson
Tim Peters, New Jersey Department of Education
Gayle Potter, Arkansas Department of Education
Bernie Sandruck, Howard Community College

The Role of Technology in Improving Turnaround Time and Quality in Large Scale Assessments

To meet their ever-increasing needs to have faster turnaround of test scores and more defensible scores, states are searching for ways to satisfy their constituents and to meet the demands of high stakes testing. This panel will discuss implementations of programs that leverage technology One test that is administered frequently and requires immediate turnaround is the ACCUPLACER, which is given to incoming freshmen at community colleges. The College Board uses an online testing environment to administer the tests and an automated intelligence engine to score them. North Carolina uses a distributed scoring model that engages large groups of scorers for its writing assessment and local district scoring of multiple choice tests for their EOC and EOG tests. Virginia administers all multiple choice tests online with Rapid Turn Around to score and deliver results. These three models meet the demands of later or on-demand testing and the demands for faster or immediate test results.

Presenters:
Daisy Vickers, Pearson
Jim Kroenig, North Carolina DOE
Ed Hardin, The College Board
Shelley Loving-Ryder, Virginia Department of Education

Accurate and Time-Saving: Online Assessment of Oral Reading Fluency Using Advanced Speech Processing Technology

This panel will describe four studies conducted across eight states investigating the usability and impact of using an online, automated test delivery and scoring system to measure and track students’ oral reading fluency (ORF) performance. The ORF system produced words correct per minute (WCPM) scores for oral reading samples from hundreds of 1st through 5th graders. The session includes discussion of: 1) technical and practical challenges involved in large-scale test delivery, scoring, and data management; 2) the reliability of the automated scoring system which produces scores that correlate highly (0.95-0.99) with teachers scoring manually; 3) policy-related impacts of reliability data comparing machine scores with scores from expert test administrators and classroom teachers; 4) innovative methods for scoring other aspects of ORF (e.g., expressiveness, accuracy); 5) teacher feedback on the value of the automated system, including how automated scoring enables reallocation of teacher time from test administration and scoring to instruction.

Presenters:
Ryan Downey, Pearson
David Rubin, Pearson
Jack Shaw, National DIBELS
DeAnna Pursai, San Jose Unified School District, CA

Developing Valid Alternate Assessments with Modified Achievement Standards: Three States' Approaches

Implicit in the design of an alternate assessment based on modified achievement standards (AA-MAS) is a validity argument that the assessment appropriately and accurately measures the grade-level academic achievement of students in a targeted sub-population of students with disabilities. In this session three states at different stages in the development process will present their approaches to developing valid AA-MAS. The states will discuss their test development models and factors influencing their choice of model. Presentations will cover the rationales and processes used for item development, research used to support test development, activities involving stakeholders, and the process of collecting validity evidence. A discussant will respond to the development models presented focusing on threats to validity, documenting validity evidence, and employing ongoing validity evaluations. The discussant’s presentation will be relevant to the test designs used in the three states and to AA-MAS test development more generally.

Presenters:
Kelly Burling, Pearson
Shelley Loving-Ryder, Virginia Department of Education
Cari Wieland, Texas Education Agency
Elizabeth Hanna, Pearson
Malissa Cook, Oklahoma Department of Education
Discussant:
Stuart Kahl, Measured Progress

Legislation in the One Corner, Implementation in the Other: And, It’s a Knock Out

Assessment legislation often gets passed before an implementation plan is fully vetted. In this session, Minnesota, Nebraska, and Texas square off against challenging assessment legislation. They will describe how they bob and weave as policy is thrown their way to develop implementation plans that are reasonable and in the best interest of their students. Presenters will share legislation that turned the local assessment system upside down, such as Texas legislation limiting field-testing at a time when 12 end-of-course assessments are to be rolled out. Join this session and listen as these three states explain how they knocked out the legislative challenges early instead of going the full 12 rounds.

Presenters:
Kimberly O'Malley Pearson
Christy Hovanetz Lassila, Consultant
Pat Roschewski, Nebraska Department of Education
Gloria Zyskowski, Texas Education Agency
Discussant:
Roger Trent, Executive Director Emeritus for the Ohio Department of Education

Cognitive Interviews Applied to Test and Item Design and Development for AA MAS (2 percent)

The session focuses on applying cognitive interviews (CI) in the development of alternate assessments judged against modified achievement standards (AA-MAS) and presents principles for using CI with AA-MAS that were formulated during a research symposium and published in a 2009 white paper. The symposium built on work of Designing Accessible Reading Assessments and Partnership for Accessible Reading Assessment, addressing think aloud methods with students eligible for AA-MAS from recent research. White paper principles and four recent CI AA-MAS studies will be presented. Studies will be discussed in the context of the principles and how results can be used to inform test design and development in AA-MAS. Copies of the white paper will be available and audience interaction will be encouraged. Participants will be invited to ask questions, offer experiences, and discuss methods for interviewing students with limited communication, gathering reliable data, and applying CI to AA-MAS test development in their own settings.

Presenters:
Patricia Almond, University of Oregon
Caroline E. Parker, EDC
Chris Johnstone, NCEO
Shelley Loving-Ryder, Virginia Department of Education
Jennifer Stegman, Oklahoma Department of Education
Kelly Burling, Pearson
Discussant:
Phoebe Winter, Pacific Metrics Corporation

Friday, May 15, 2009

If David Beats Goliath, Just What Role Does the Psychometrician Play?

So I admit it—I, too, like a good, sensationalized feel-good piece of literature, one that particularly has some application to what we do for a living. My very good but departed friend, Ed Slawski, use to say I was such a soft touch.

Regardless, I recently read a piece by that world famous author Malcolm Gladwell (you know the Tippling Point, Blink, and Outliers guy) that did indeed move me to write this blog. His article appeared in The New Yorker (yes, even an old curmudgeon like me actually subscribes to the New Yorker) and is titled "How David Beats Goliath." In this article, Mr. Gladwell parallels how the underdogs seem to win more often than they should because they change the rules on how the game is played. He uses the full-court press in basketball as an example of a strategy that a smaller, less talented team might use to beat a taller, more talented team. He actually cites statistics regarding the success of such a strategy.

Psychometricians, as you may well know, are very methodical people. They like specifications outlining what it is that they do. They like to follow procedures, are often meticulous, and believe in verification, transparency, and replicability. Simply stated, they follow the rules. So, the question at hand is: Will psychometricians be the David or the Goliath of the new assessment reform implicit in the new administration and more explicit as a goal of Secretary Duncan?

Certainly, the brave new world of assessments in the post-NCLB era will be unlike what we have seen to date. Reliability as a measure of internal consistency, and validity as a correlation coefficient with existing measures are not likely to be the psychometric quality mantra moving forward. These new assessments are likely to be driven by needs for problem-solving measures, measures of critical thinking, assessments of our ability to manage large amounts of information (presumably coming from the Internet), and comparability with international benchmarks all implemented and managed in an online and automated way. By definition, these new assessments will violate the rules and will be the David that defeats Goliath. I wonder how the greater psychometric community will react. "Just say no" comes to mind but will likely be a woefully inadequate response.

I certainly hope that under my direction, my staff will embrace the need to see the world differently now, even more so, lest we fall behind and are not in a position to support this next generation of assessments. Pearson plans to be the giant killer in this regard, changing the rules and leading the way into this new generation of learning. What is it that you plan to do?

Wednesday, April 15, 2009

Pearson at NCME & AERA

Pearson once again dominates presentations (well, we have a lot anyway) at the annual meetings of the National Council on Measurement in Education (NCME) and the American Educational Research Association (AERA).


NCME


Um, K., Way, W. D., Fitzpatrick, S. J., & Kreiman, C.
The Effects of Response Probability Criteria on the Scale Location Estimation and Impact Data in Standard Setting.

Wan, L., & Henly, G.
Measurement Properties of Innovative Item Formats in a Computer-Based Science Test

Arce-Ferrer, A.
An Investigation of Traditional and Alternative Approaches to Vertically Scale Modified Angoff Cut Scores

Seo, D., Shin, S., Taherbhai, H., & Sun, Y.
Exploring and Explaining Gender Format Differences in English as a Second Language Writing Assessment Using Logistic Mixed Models

Shin, C. D., Ho, T., Chien, Y., & Deng, H.
A Comparison of Person-Fit Statistics in Computerized Adaptive Test Using Empirical Data

Turhan, A., Courville, T., & Keng, L.
The Effects of Anchor Item Position on a Vertical Scale Design

Tong, Y. & Kolen, M.
A Further Look into Maintenance of Vertical Scales

Turhan, A., Lin, C., O’Malley, K. & Kolen, M.
Vertical Scaling for Paper and Online Assessments

Kahraman, N., & Thompson, T.
Relating Unidimensional IRT Parameters to a Multidimensional Response Space: A Comparison of Two Alternative Dimensionality Reduction Approaches

Meyers, J. L., Turhan, A., & Fitzpatrick, S. J.
Interaction of Calibration Procedure and Ability Estimation Method for Writing Assessments under Conditions of Multidimensionality

Shin, C. D, Chien, Y., & Way, W. D.
The Weighted Penalty Model and Conditional Randomesque Method for Item Selection in Computerized Adaptive Tests

Yi, Q.
The Impact of Ability Distribution Differences Between Beneficiaries and Non-Beneficiaries on Test Security Control in CAT

Mao, X. & Fitzpatrick, S. J.
An investigation of the Linking of Mathematics Tests with and Without Linguistic Simplification

Ming, X., Wang, J., & Wu, S.
A Predictive Validity Study of an English Language Proficiency Test

Arce-Ferrer, A.
Anchor-Test Design Issues in Equating

Chien, Y., Shin, C. D, & Way, W. D.
Weighted Penalty Model for Content Balancing in CAT

Keng, L. & Dodd, B. G.
A Comparison of the Performance of Testlet-Based Computer Adaptive Tests and Multistage Tests

Song, T. & Arce-Ferrer, A.
Comparing IPD Detection Approaches in Common-Item Non-Equivalent Group Equating Design

Wang, C., Wei, H., & Gao, L.
Investigating the Effects of Speededness on Test Dimensionality

Wei, H.
The Effect of Test Speededness on Item and Ability Parameter Estimates in Multidimensional IRT Models

McClarty, K., Lin, C., & Kong, J.
How Many Students Do You Really Need? The Effect of Sample Size on the Matched Samples Comparability Analysis

Thompson, T.
Scale Construction and Conditional Standard Errors of Measurement

Ye, F., You, W., & Xu, T.
Multilevel Item Response Model for Longitudinal Data


NCME Training Sessions


Kolen, M. & Tong, Y.
Vertical Scaling Methodologies, Applications and Research


NCME Discussants


Shin, C. D.
Standard Errors of Equating

Way, W. D.
Current Practices in Licensure and Certification Testing

Tong, Y.
Test Equating with Constructed-Response Items and Mixed-Format Tests


NCME Moderators


Dolan, R.
Comparability of Paper-and-Pencil and Computer-Based Exams

Nichols, P. D.
Modifications of Traditional Methods of Setting Standards


AERA


Fulkerson, D., Nichols, P. D., Mislevy, R., Liu, M., Zalles, D., Fried, R., Villalba, S., Debarger, A., Cheng, B., Mitman, A., Haertel, G., & Cho, Y.
Research Findings: Leveraging ECD in Scenario-Based Science Assessments

Dolan, R., Way, W. D., & Nichols, P. D.
Technical Quality of Formative Assessments Within Online Instructional Tool

Fulkerson, D., Mittelholtz, D., & Nichols, P. D.
The Psychology of Writing Items: Improving Figural Response Item Writing

Lau, C., Zhang, L. & Jiang, X.
Using Pass/Fail Pattern to Predict Students’ Success for Standards: A Longitudinal Study with Large-Scale Assessment Data

Stephenson, A. & Song, T.
Using HLM to Investigate Longitudinal Growth of Students’ English Language Proficiency

Beretvas, S. N., Chung, H., & Meyers, J. L.
Modeling Rater Severity Using Multiple Membership, Cross-Classified, Random-Effects Models

Harrell, L. & Wolfe, E. W.
A Comparison of Global Fit Indices as Indicators of Multidimensionality in Multidimensional Rasch Analysis

Yue, J., Creamer, E., & Wolfe, E. W.
Measurement of Self-Authorship: A Validity Study Using Multidimensional Random Coefficients Multinomial Logit Model

Murphy, D. L.
Multilevel Growth-Curve Modeling: A Power Analysis of the Unstructured Covariance Matrix

Arce-Ferrer, A., Song, T., & Sullivan R.
Linking Strategies and Item Screening Approaches: A study with Augmented Nationally Standardized Tests Informing NCLB

Shin, C. D. & Chien, Y.
Conditional Randomesque Method for Item Exposure Control in CAT

Thompson, T. & Way, W. D.
Using CAT to Achieve Comparability with a Paper Test

Yang, Z. & Wang, S.
Calibration Methods Comparison with the Rasch Model

Shin, C.D.
Using Bayesian Sequential Analyses in Evaluating the Prior Effect for Two Subscale Score Estimation Methods

Beimers, J.
Consistency of District Annual Yearly Progress (AYP)Determinations Across Three Types of NCLB Growth Models

Wang, Z., Taherbhai, H., Xu, M. & Wu, S.
Modeling Growth in English Language Proficiency with Longitudinal Data Using the Latent Growth Curve Model

Way, W. D.
Increased Use of Technology in Testing

McGill, M. & Wolfe, E. W.
Assessing Unidimensionality in Item Response Data via Principal Component Analysis of Residuals from the Rasch Model

Ingrisone, J.
Modeling the Joint Distribution of Response Accuracy and Response Time

Ingrisone, S.
An Extended Item Response Model Incorporating Item Response Time

Sung, H.
Developing a Short Form of the Enright Forgiveness Inventory Using Item Response Theory

Arce-Ferrer, A., Wang, Z., & Xue, Q. (2009)
Applying Rasch Model and Generalizability Theory to Study Modified Angoff Cut Scores for Reporting with Vertical Scales

McGill, M., Wolfe, E. W., & Jarvinen, D.
Validation of Measures of the Quality of Mentoring Experiences of New Teachers

Jiao, H., Wang, S., Wan, L. & Lu, R.
An Investigation of Local Item Dependence in Scenario-Based Science Assessments

Tsai, T. & Shin, C. D.
Generalizability Analyses of a Case-Dependent Section in a Large-Scale Licensing Examination


AERA Discussants


Tong, Y.
Factors Influencing Equating Accuracy

Nichols, P. D.
Standards, Proficiency Judgments, and Norms


AERA Chairs


Mueller, C.
Studies Examining Achievement Gaps

Monday, March 09, 2009

NAEP: Love It or Leave It!

As the national debate about what to do with education reform rages, I hear repeatedly the need to have national standards, common core standards, or at the very least, a more standardized system of national guidelines in order for us to measure and, presumably then, improve education in America. Often this discussion leads to debate about the merits of a national test—which is often assumed to be NAEP, sometimes called the "nation's report card.”

NAEP is well researched, well documented and seems to be well loved if not revered by most psychometricians—other than me and a few others who dare to challenge the status quo. I have questioned the usefulness of NAEP as a "check test" for NCLB at various times in my career, all based on the following premise:
  • Students who take NAEP are essentially unmotivated; while most students are highly motivated to pass mandated state assessments.
  • NAEP essentially measures a "consensus" national curriculum; while state assessments measure very specific content standards, which presumably align or mirror instruction.
  • NAEP is individually administered via a specific student sample where students only take portions of the assessment; whereas all students take the complete statewide assessment.
  • NAEP was targeted to measure at a higher level of proficiency (for example, 39% of all students were at or above proficient on NAEP Mathematics in 2007); whereas for most statewide assessments the percentages were much larger.
When I speak with my colleagues most of them say things like: Lighten up! NAEP is great! Math is math. Etc. Oddly enough, for a group of well-trained colleagues who claim to be scientists, such indefensible positions leave me wanting. Good news—there is some sensible research in the measurement literature. I am referring to the article by Dr. Andrew Ho, from the University of Iowa, published in Educational Measurement: Issues and Practices. In this article, entitled Discrepancies Between Score Trends from NAEP and State Tests: A Scale-Invariant Perspective, Dr. Ho provides a balanced and well-reasoned argument regarding the usefulness of NAEP for such comparisons. I provide quotes (albeit taken out of context) such that your curiosity will be peaked and you will seek out and read his entire article.

“Given a perspective that NAEP and State tests are designed to assess proficiency along different content dimensions, State-NAEP discrepancies are not cause for controversy but a baseline expectation.”

“Trends for a high-stakes, or ‘focal’ test, may differ from trends on a low-stakes test like NAEP (an ‘audit’ test) for a number of reasons, including different "elements of performance" sampled by both tests, different examinee sampling frames, or differing changes in student motivation.”

“As NAEP adjusts to its confirmatory role, there must be an additive effort to temper expectations that NAEP and State results should be identical.”

My colleagues, who have violated this last conclusion by Dr. Ho, are doing the policy makers and implementers of the NCLB "law of the land" a disservice by suggesting that statewide assessments are somehow inferior simply because their results are not replicated on NAEP. Let's get back to speaking about the science of assessment and experimental comparison and leave the passion and politics to someone else.