Monday, June 14, 2010

Pearson at CCSSO - June 20-23. 2010

I'm looking forward to seeing folks at the upcoming Council of Chief State School Officers (CCSSO) National Conference on Student Assessment (in Detroit, June 20-23, 2010). Pearson employees will be making the following presentations. We hope to see you there.

Multi-State American Diploma Project Assessment Consortium: The Lessons We’ve Learned and What Lies Ahead
Sunday, June 20, 2010: 1:30 PM-3:00 PM
Marquette (Detroit Marriott at the Renaissance Center)
Shilpi Niyogi

Theory and Research On Item Response Demands: What Makes Items Difficult? Construct-Relevant?
Sunday, June 20, 2010: 1:30 PM-3:00 PM
Mackinac Ballroom West (Detroit Marriott at the Renaissance Center)
Michael J. Young

Multiple Perspectives On Computer Adaptive Testing for K-12 Assessments
Sunday, June 20, 2010: 3:30 PM-5:00 PM
Nicolet (Detroit Marriott at the Renaissance Center)
Denny Way

The Evolution of Assessment: How We Can Use Technology to Fulfill the Promise of RTI
Monday, June 21, 2010: 3:30-4:30 PM
Cadillac B (Detroit Marriott at the Renaissance Center)
Christopher Camacho
Laura Kramer

Comparability: What, Why, When and the Changing Landscape of Computer-Based Testing

Tuesday, June 22, 2010: 8:30 AM-10:00 AM
Duluth (Detroit Marriott at the Renaissance Center)
Kelly Burling

Measuring College Readiness: Validity, Cut Scores, and Looking to the Future

Tuesday, June 22, 2010: 8:30 AM-10:00 AM
LaSalle (Detroit Marriott at the Renaissance Center)
Jon Twing & Denny Way

Best Assessment Practices
Tuesday, June 22, 2010: 10:30 AM-11:30 AM
Mackinac Ballroom West (Detroit Marriott at the Renaissance Center)
Jon Twing

Distributed Rater Training and Scoring

Tuesday, June 22, 2010: 4:00 PM-5:00 PM
Richard (Detroit Marriott at the Renaissance Center)
Laurie Davis, Kath Thomas, Daisy Vickers, Edward W. Wolfe

Identifying Extraneous Threats to Test Validity for Improving Tests and Using of Tests
Tuesday, June 22, 2010: 4:00 PM-5:00 PM
Joliet (Detroit Marriott at the Renaissance Center)
Allen Lau


--------------------------
Edward W. Wolfe, Ph.D.
Senior Research Scientist

Wednesday, June 09, 2010

Innovative Testing

Innovative testing refers to the use of novel methods to test students in richer ways than can be accomplished using traditional testing approaches. This generally means the use of technology like computers to deliver test questions that require students to watch or listen to multimedia stimuli, manipulate virtual objects in interactive situations, and/or construct rather than select their responses. The goal of innovative testing is to measure students’ knowledge and skills at deeper levels and measure constructs not easily assessed, such as problem solving, critical analysis, and collaboration. This will help us better understand what students have and haven’t learned, and what misconceptions they might hold—and thus support decisions such as those related to accountability as well as what instructional interventions might be appropriate for individual students.

Educational testing has always involved innovative approaches. As hidden properties of students, knowledge and skill are generally impossible to measure directly and very difficult to measure indirectly, often requiring the use of complex validity arguments. However, to the extent that newer technologies may allow us to more directly assess students’ knowledge and skills—by asking students to accomplish tasks that more faithfully represent the underlying constructs they’re designed to measure—innovative testing holds the promise of more authentic methods of testing based upon simpler validity arguments. And as such, measurement of constructs that are “higher up” on taxonomies of depth of understanding, such as Bloom’s and Webb’s, should be become more attainable.

Consider assessing a high school student’s ability to design an experimental study. Is this the same as his or her ability to identify one written description of a well-designed experiment amongst three written descriptions of poorly designed experiments? Certainly there will be a correlation between the two; the question is how good a correlation, or more bluntly, how artificial the context. And further, to what extent is such a correlation a self-fulfilled prophesy, such that students who might be good at thinking and doing science but not at narrowly defined assessment tasks are likely to do poorly in school as a result of poor test scores due to the compounding impact of negative feedback?

Many will recall the promise of performance assessment in the 90’s to test students more authentically. Performance testing didn’t live up to its potential, in large part because of the challenges of standardized administration and accurate scoring. Enter innovative questions—performance testing riding the back of digital technologies and new media. Richer assessment scenarios and opportunities for response can be administered equitably and at scale. Comprehensive student interaction data can be collected and scored by humans in efficient, distributed setting, automatically by computer, or both. In short, the opportunity for both large-scale and small-scale testing of students using tasks that more closely resemble real-world application of learning standards is now available.

Without question, creating innovative test questions presents additional challenges over that of simpler, traditional ones. As with any performance task, validity arguments become more complex and reliability of scoring becomes a larger concern. Fortunately, there has been some excellent initial work in the area of understanding how, including the development of taxonomies and rich descriptions for understanding innovative questions (e.g., Scalise; Zenisky). Most notable are two approaches that directly address validity. The first is evidence-centered design, an approach to creating educational assessments in terms of evidentiary arguments built upon intended constructs. The second is a preliminary set of guidelines for the appropriate use of technology in developing innovative questions through application of universal design principles that take into account how students interact with those questions as a function of their perceptual, linguistic, cognitive, motoric, executive, and affective skills and challenges. Approaches such as these are especially essential if we are to help ensure the needs of students with disabilities and English language learners are considered from the beginning in designing our tests.

Do we know that innovative questions will indeed allow us to test students to greater depths of knowledge and skill than traditional ones, and whether will they do so in a valid, reliable, and fair manner? And will the purported cost effectiveness be realized? These are all questions that need ongoing research.

As we solve the challenges of implementing innovative questions in technically sound ways, perhaps the most exciting aspect of innovative testing is the opportunity of integration with evolving innovative instructional approaches. Is this putting the cart before the horse to focus so much on innovation in assessment before we figure it out in instruction? I believe not. Improvements to instructional and assessment technologies must co-evolve. Our tests must be designed to pick up the types of learning gains our students will be making, especially when we consider 21st century skills, which will increasingly rely on innovative, technology-based learning tools. Plus our tests have a direct opportunity to impact instruction: despite all our efforts, “teaching to the test” will occur, so why not have those tests become models of good learning? And even if an emphasis on assessment is the cart, at least the whole jalopy is going the correct way down the road. Speaking of roads, consider the co-evolution of automobiles and the development of improved paving technologies: improvement in each couldn’t progress without improvement in the other.

Bob Dolan, Ph.D.
Senior Research Scientist

Wednesday, June 02, 2010

Where has gone the luxury of contemplation?

I ran across an Excel spreadsheet from some years ago that I had used to plan my trip to attend the 2004 NCME conference in San Diego. The weather was memorable that year. But I also attended a number of sessions during which interesting papers were presented and discussants and audience members made compelling comments.

My new memories of the 2010 NCME conference are different. I am grateful that the weather was pleasant. But I have memories of rushing from one responsibility to the next responsibility. I am sure the 2010 NCME conference included interesting papers and compelling commentary but memories of them were overshadowed by a sense of haste and feeling of urgency. This impression was of my own doing. First, I arrived several days late because of my already crowded travel schedule. Second, I participated in the conference in several roles: as presenter, discussant and co-author.

What I missed this year in Denver was the luxury of contemplation. I missed the luxury of sitting in the audience and reacting to the words and ideas as they rolled from the tongues of the presenters. I missed the luxury of mentally inspecting each comment from the discussants or the audience members and comparing them with my own reactions. I missed the luxury of chewing over the last session with a colleague as we walked through the hotel hallway and maybe grabbed lunch before the next session.

I can and did benefit from attending the NCME conference without the luxury of contemplation. But I missed the pleasure and comfort from indulging in the calm and thoughtful appreciation of the labors of my colleagues. These days we rarely indulge in the luxury of contemplation and we are often impoverished because of it.

Paul Nichols, PhD
Vice President
Psychometric & Research Services
Assessment & Information
Pearson

Thursday, April 29, 2010

Pearson’s participation at AERA and NCME

As I read the latest TMRS Newsletter, I was reminded of my first days at Pearson. Back then, the department, lead by Jon Twing, consisted of 5 other staff. We were called psychometricians and we worked closely together across operational testing programs.

The department grew rapidly after that. As part of that growth, Jon and I created the Researcher-Practitioner model as an ideal. Under the Researcher-Practitioner Model, practice informs research and research supports practice. The role of the psychometrician combines research and fulfillment. Our department would not have separate groups of staff to perform research and fulfillment functions. Instead, our department would use the same staff members to perform both activities. Each psychometrician would dedicate the majority of their hours to contract fulfillment and the remaining hours to research.

Many things have changed since those first days but some things remain the same. With more than 50 research scientists, we are still a close-knit group. But the label “psychometrician” has been replaced with “research scientist.” And we are still working toward the ideal of the Researcher-Practitioner. While we may not have achieved the ideal, the list of Pearson staff participating in the annual conference of the American Educational Research Association (AERA) and the annual conference of the National Conference on Measurement in Education (NCME) in the latest TMRS Newsletter is proof that we are still active researchers. In Denver, Pearson staff provide 22 presentations at AERA and 15 presentations at NCME. In addition, Pearson research scientists will make three presentations at the Council of Chief State School Officers’ (CCSSO) National Conference on Student Assessment and will present at Society for Industrial & Organizational Psychology (SIOP) and the International Objective Measurement Workshop (IOMW).

Please review the research that Pearson research scientists will be presenting at these meetings that are listed in the newsletter. If you are interested in reading the conference papers, several are listed on the conference reports tab on the Research & Resources
page of the Pearson Assessment & Information website.


Paul Nichols, PhD
Vice President
Psychometric & Research Services
Assessment & Information
Pearson

Wednesday, April 07, 2010

Performance-based Assessment Redux

Cycles in educational testing continue to repeat. The promotion and use of performance-based assessments is one such cycle. Performance-based assessment involves the observation of students performing authentic tasks in a domain. The assessments may be conducted in a more- or less-formal context. The performance may be live or may be reflected in artifacts such as essays or drawings. Generally, an explicit rubric is used to judge the quality of the performance.

An early phase of the performance-based assessment cycle was the move from the use of performance-based assessment to the use of multiple-choice tests as documented in Charles Odell’s 1928 book, Traditional Examinations and New-type Tests. The “traditional examinations” Odell referred to were performance-based assessments. The “new-type tests” Odell referred to were multiple-choice tests that were beginning to be widely adopted in education. These “new-type tests’ were promoted as an improvement over the old performance-based examinations in efficiency and objectivity. However, Odell had doubts.

I am not old enough to remember the original movement from the use of performance-based assessment to the use of multiple-choice tests but I am old enough to remember the performance-based assessment movement of the 1990s. As I remember it, performance-based assessment was promoted in reaction to the perceived impact of multiple-choice accountability tests on teaching. Critics worried that the use of multiple-choice tests in high-stakes accountability testing programs was influencing teachers to teach to the test, e.g., focus on teaching the content of the test rather than a broader curriculum. Teaching to the test would then lead to inflation of test scores that reflected rote memorization rather than learning in the broader curriculum domain. In contrast, performance-based testing was promoted as a solution that would lead to authentic student learning. Teachers that engage in teaching to a performance-based test would be teaching the actual performances that were the goals of the curriculum. An example of a testing program that attempted to incorporate performance-based assessment on a large scale was the Kentucky Instructional Results Information System.

It’s déjà vu all over again, as Yogi said, and I am living through another phase of the cycle. Currently, performance-based assessments are being promoted as a component of a balanced assessment system (Bulletin #11 ). Proponents claim that performance-based assessments administered by teachers in the classroom can provide both formative and summative information. As a source of formative information (Bulletin #5 ), the rich picture of student knowledge, skills and abilities provided by performance-based assessment can be used by teachers to tailor instruction to address individual student’s needs. As a source of summative information, the scores collected by teachers using performance-based assessment can be combined with scores from large-scale standardized tests to provide a more balanced view of student achievement. In addition, proponents claim that performance-based assessments are able to assess 21st Century Skills whereas other assessment formats may not.

But current performance-based assessments still face the same technical challenges that they faced in the 1990s. A major technical challenge facing performance-based assessments is adequate reliability of scores. Variance in both teachers’ ratings and task sampling may contribute to unacceptably low score reliability for scores used for summative purposes.

A second major challenge facing performance-based assessments is adequate evidence of validity. Remember that performance-based assessment scores are being asked to provide both formative and summative information. But validity evidence for formative assessment stresses consequences of test score use whereas validity evidence for summative assessment stresses more traditional sources of validity evidence.

A third major challenge facing performance-based assessments is the need for comparability of scores across administrations. In the past, the use of complex tasks and teacher judgments has made equating difficult.

Technology to the rescue! Technology can help address the many technical challenges facing performance-based assessment in the following ways:
  • Complex tasks and simulations can be presented in standardized formats using technology to improve standardization of administration and broaden task sampling;
  • Student responses can be objectively scored using artificial intelligence and computer algorithms to minimize unwanted variance in student scores;
  • Teacher training can be detailed and sustained using online tutorials so that teachers’ rating are consistent within teachers across students and occasions and across teachers; and,
  • Computers and hand-held devices can be used to collect teachers’ ratings across classrooms and across time so that scores can be collected without interrupting teaching and learning.

Save your dire prediction for others, George Santayana. We may not be doomed to repeat history, after all. Technology offers not just a response to our lessons from the past but a way to alter the future.

Paul Nichols, PhD
Vice President
Psychometric & Research Services
Assessment & Information
Pearson