Friday, February 25, 2011

Old Dogs Need New Tricks

As soon as I finished graduate school my husband and I got two puppies, small Italian Greyhounds named Cyclone and Thunder. We named them after the mascots from our undergraduate universities. We took them everywhere with us and spoiled them rotten until the kids arrived. Once we had a toddler walking around the house, it was obvious that these old dogs needed to learn some new tricks. We had to teach them not to jump up on the kids or lick their faces. We had gotten small dogs because we weren’t going to have a lot of time to really train them, but now we knew we were going to have to be creative in teaching them that it really wasn’t okay to treat a toddler like a lollipop. It was going to be a challenge, but we were motivated.

Recently I was able to attend an event related to one of my challenges at work. The Center for K-12 Assessment and Performance Management at ETS hosted a research symposium on Through-Course Summative Assessments. Attendance at the conference was by invite only and each organization was only able to send one or two participants. I was excited to represent Pearson with such a distinguished group of researchers and thinkers. Through-course summative assessment poses some incredible challenges for traditional psychometrics, and I was eager to hear the recommendations from leaders in the field on issues such as how to handle reliability, validity, scoring, scaling, and growth measures in these types of assessments. Instead, the eight papers generally focused on the identification of issues and challenges, resulting in many recommendations for further research. Very few solutions were proposed, and many of the solutions that were proposed did not seem very viable for large-scale testing. It was clear to me that, just like my Italian Greyhounds, we old psychometricians will need some new tricks.

Although most of the papers focused on somewhat technical measurement topics, the audience at the symposium was really a mix of technical and policy experts. The tension between those viewpoints was evident throughout the conference. As set out in the presentations of the initial policy context, the next generation assessments designs proposed by the Partnership for Assessment of Readiness for College and Careers (PARCC) and by the SMARTER Balanced Assessment Consortium (SBAC)—complete with formative, interim, through-course, summative, and performance components—will be used to:

  • signal and model good instruction,
  • evaluate teachers and schools,
  • show student growth and status toward college and career readiness,
  • diagnose student strengths and weaknesses to aide instructional decision making,
  • be accessible to all student populations regardless of language proficiency or disability status, and
  • allow the United States to compete with other nations in a global economy.
That’s a tall order! It’s like trying to teach a puppy to sit, stay, roll over, and fetch at the same time.

The policy goals and several of the desired policy uses of the assessments are clear. What is not as clear is what psychometric models can be used to support these claims. It was mentioned more than once at the conference that if a test has too many purposes, it is unlikely that any purpose will be well met. I think it’s clear, however, that the new assessments will be used for all those purposes, and the assessment community must find a way to support them.

Too often the psychometric mantra has been “Just say no.” If you recall, that was the advertising campaign for the war on drugs in the 80’s and 90’s. It’s time to move to the 21st century. Assessments will be used for more than identifying how much grade-level content a student has mastered. We may not have originally developed assessments to be used for evaluating teachers, but they are used for this and will continue to be. In the same way, high school assessments will be used to predict readiness for college and careers. Policy makers are asking for our help to design and provide validity evidence for assessments that will serve a variety of purposes. No, the assessments may not have the same level of standardization and tight controls, but they still can be better than an alternative design that excludes psychometrics entirely.

There is already a mistrust of testing and an overload of data. Moving forward, we need to work with teachers, campus leaders, parents, and the community to better involve them in the testing process and particularly in the processes for reporting and interpreting test results. Tests should not be administered simply to check off a requirement. The data produced from assessments should inform instruction, student progress, campus development, and more. The assessments are not isolated events, but rather part of a larger educational system of instruction and assessment with the goal of preparing students for college and careers. This is a worthy goal. As a trained psychometrician, I also struggle with determining how far we can push the boundaries in meeting this goal before we’ve stepped over the line. If I bathe my kids in lemon juice to keep the dogs from licking them, have I gone too far? It may seem like a crazy idea, but I can’t ignore the need to think differently.

Indeed the next generation assessments, including through-course summative assessments, will provide new challenges and opportunities for psychometrics and research. The research, however, must be focused around solving the practical challenges that the assessment consortia will face. States are looking to us to be creative and propose solutions, not develop a laundry list of problems. There is no perfect solution. Instead, psychometrics must take steps forward to present innovative assessment solutions that balance the competing priorities and bring us closer to the goal of improving education for all students. We must continue to research and use that research to refine and update the assessment systems.

As Stan Heffner, Associate Superintendent for the Center for Curriculum and Assessment at the Ohio Department of Education, discussed in his presentation, “This is a time to be propelled by ‘what should be’ instead of limited by ‘what is’.”

He was too polite to really say it, but I think he meant that old dogs need new tricks.

Katie McClarty
Senior Research Scientist
Test, Measurement & Research Services
Assessment and Information

Tuesday, February 08, 2011

Commonality in Assessment Solutions

Why isn’t it simple to develop common (off-the-shelf) solutions for processing assessments? I have come to realize that this industry does have some very subtle (and some not so subtle) complexities. Is all of this complexity really necessary?

I see two levels of complexity in assessment processing:

1. Developing, processing, scoring, and reporting assessments themselves, and

2. Handling the variability of data and business rules to process the data across programs.

The first level of complexity is mostly unavoidable. Variability at this level should be entirely driven by the assessment design, purpose, and types of items being used. However, the second level of complexity is mostly avoidable if we could agree on common structures and rules for defining and processing data.

Recent activities in Race to the Top and Common Core State Standards are exploring the use of “open industry standards” and “open source” to drive assessment solutions design and development. To that I say, “Outstanding!” However, are we thinking too narrowly? The standardization efforts have initially focused on learning standards, assessment content, and assessment results standards – again, all good. But what about all the data and processes surrounding the assessments? We will continue to develop custom data, business rules, and processes for each implementation of the “common” assessments to deal with the second level of variability.

Much of this variability is not a result of “adding value” to the assessment design or the outcomes of the assessment but has more to do with accommodating existing data management systems and processes, or in some cases simple preferences. While most of this may fall into the category of “just being flexible,” there are thousands of examples of variability in assessment solutions. These extreme levels of variability (or flexibility) have directly contributed to a very high level of custom software development in our industry. For an industry that is very schedule-, cost-, and quality-sensitive, this reality seems counterintuitive.

Here is a (non-comprehensive) list of potential opportunities for commonality:

  • Matching rules and assignment of state student identifiers

  • District and school identifiers, structures, and data

  • Scoring rules such as best form, attemptedness, and rules for scoring open-ended items

  • Multiple test attempt processing and status codes

  • Scanable form design

  • Rules and processes for packaging and distributing testing materials

  • Report designs
If the efforts associated with Common Core could consider much greater commonality across implementations, vendors could provide more consistent “off-the-shelf” solutions for all states. The solutions also become highly reliable with less customized code, data, and processing rules being injected into every customer’s deployment. Unique custom solutions – as experience tells us – can be more costly and more prone to schedule and quality issues.

Assessment industry examples that illustrate the potential of common capabilities include the
American Diploma Project (ADP), ACT and SAT, Iowa Test of Basic Skills (ITBS), and Stanford Achievement Tests (SAT10), among many other catalogue programs. These programs provide highly common data, processes, and reporting engines for all states or districts using them. Some variability is possible but within well defined constraints. While these programs have felt the pressure to customize, they have been able to retain many of their core data and capabilities. For example, all members of the ADP consortium agreed on demographic reporting categories for the Algebra assessment; therefore, each state collects and reports demographic data in the same way for the same assessment.

So now you might be thinking “talk about stifling innovation!” Actually, I am suggesting that we stifle variability where it does not add value to the assessment design or outcome. Instead, invest all of that energy in innovation for assessment instruments and interpreting results to improve teaching and learning.

As a final note, I have been careful to use the term “common” versus “standard.” It is possible to obtain commonality without formal standards – fully recognizing that commonality can also be achieved through the implementation of formal standards. All the standards development efforts in this industry are moving us closer to commonality, but are they timely enough and can the multiple standards converge?

Work will begin very soon on the Common Core assessment content and supporting platform designs, most likely in advance of the formal industry standards being available to support all activities. By working with the vendors who provide assessment services, it may be possible to make common all of those processes which are not directly contributing to improving teaching and learning - thus simplifying solutions, reducing costs, and allowing that money to be saved or diverted to high value innovations.

Complementary or contradictory points of view are welcomed.

Wayne Ostler
Vice President Technology Strategy
Data Solutions
Assessment & Information