Many presentations at the annual CCSSO conference on Large Scale Assessment, taking place here in sunny San Antonio, reference the need to expand testing to include more formative assessments.

Pearson Educational Measurement has our own set of formative assessments know as PASeries. A white paper describing how PASeries was developed and how it might be used to improve student learning, as well as other information regarding PASeries, are available here at the conference and are quite popular. Check out PASeries and see for yourself.

## Monday, June 20, 2005

## Wednesday, June 15, 2005

### Models, Measurement and Learning

For years, the measurement community has debated which of a virtually limitless number of mathematical models is most appropriate for a given measurement activity. I remember the very heated discussion between Ben Wright and Ron Hambleton at an AERA/NCME conference not too long ago. Ben spoke of "objective measurement." Ron spoke of "representing the integrity" of measurement practitioners. Both sides had their points and the debate still continues in some circles.

In this age of "standards referenced" assessment, the selection of a measurement model might not be an academic debate only. Think for a minute about a standards referenced test where six of the 60 items come from a particular content domain (say Mathematical Operations). For the teacher, student and accountability folks, this means that 10% of the emphasis of the test is on operations (6/60 = 10 percent). However, measurement practitioners know that by selecting an IRT measurement model (other than Rasch), each of these items are weighted based on the slope parameter. Pattern scoring will then essentially guarantee that 10% of the test is not assigned to mathematical operations. This is because the contributions of these items to the total measure (the resulting theta value) will be weighted by the discrimination (slope) parameter. So, if the operation items do not do a good job in discriminating between high and low overall test scorers, these items are likely to contribute far less than expected to the total ability measure. While this might not be as big a deal when number correct scoring is used, the effects of this weighting are still present in research (equating and scaling).

I think we as measurement experts need to be cognizant of some distinctions when debating psychometric issues. First there is the psychometric or mathematical aspect: Which model fits the data better? Which is most defensible and practical? And which is most parsimonious? Often, I fear, psychometricians decide before seeing the data (with almost religious zeal) which model is "correct." The second aspect is one of instruction: Are we measuring students in the way their cognitive processes function? Are we controlling for irrelevant variance? And do our measures make sense in the context? Often, I think, psychometricians are too quick to compromise on measures without fully understanding constructs. Finally, we need to consider the learning aspect: Are we measuring what is being taught in the way it is being taught or are we doing something else? Are we as psychometricians measuring what is being taught, the stated curriculum, or something else (speededness for example)? Without considering these aspects, at a minimum, we are likely to argue for mathematical models that might not be helpful for our mission of improved student learning.

Just one man's opinion...

In this age of "standards referenced" assessment, the selection of a measurement model might not be an academic debate only. Think for a minute about a standards referenced test where six of the 60 items come from a particular content domain (say Mathematical Operations). For the teacher, student and accountability folks, this means that 10% of the emphasis of the test is on operations (6/60 = 10 percent). However, measurement practitioners know that by selecting an IRT measurement model (other than Rasch), each of these items are weighted based on the slope parameter. Pattern scoring will then essentially guarantee that 10% of the test is not assigned to mathematical operations. This is because the contributions of these items to the total measure (the resulting theta value) will be weighted by the discrimination (slope) parameter. So, if the operation items do not do a good job in discriminating between high and low overall test scorers, these items are likely to contribute far less than expected to the total ability measure. While this might not be as big a deal when number correct scoring is used, the effects of this weighting are still present in research (equating and scaling).

I think we as measurement experts need to be cognizant of some distinctions when debating psychometric issues. First there is the psychometric or mathematical aspect: Which model fits the data better? Which is most defensible and practical? And which is most parsimonious? Often, I fear, psychometricians decide before seeing the data (with almost religious zeal) which model is "correct." The second aspect is one of instruction: Are we measuring students in the way their cognitive processes function? Are we controlling for irrelevant variance? And do our measures make sense in the context? Often, I think, psychometricians are too quick to compromise on measures without fully understanding constructs. Finally, we need to consider the learning aspect: Are we measuring what is being taught in the way it is being taught or are we doing something else? Are we as psychometricians measuring what is being taught, the stated curriculum, or something else (speededness for example)? Without considering these aspects, at a minimum, we are likely to argue for mathematical models that might not be helpful for our mission of improved student learning.

Just one man's opinion...

## Thursday, June 02, 2005

### Testing by Computer

For the K-12 arena, testing by computer is the future and the future is now. Many schools are ready for and welcome this use of technology. Most kids consider taking a test by computer to be much less involved than downloading the latest iPods tune. But some schools aren’t ready at all. And some kids haven’t used the computer much at all. So traditional testing isn’t going away quite yet. True multiple choice testing.

Not that some sticky issues aren’t raised when high stakes testing programs are offered online. Professional testing standards are pretty clear that computer tests and paper tests must be shown to be comparable if they are to be given together. But how do you show this? Does every test administration become an experiment? The trick is to remain faithful to the standards without creating barriers to a natural innovation. Training, good customer support, and creative data analysis go a long way. Design a strong experiment if you can, collect the most relevant data possible if you can’t. The current literature on comparability is a little mixed. Some studies show some effects; others not. Most can be criticized for some design flaw or another. Not to worry, though. The schools will soon make clear what the most popular choice is.

Not that some sticky issues aren’t raised when high stakes testing programs are offered online. Professional testing standards are pretty clear that computer tests and paper tests must be shown to be comparable if they are to be given together. But how do you show this? Does every test administration become an experiment? The trick is to remain faithful to the standards without creating barriers to a natural innovation. Training, good customer support, and creative data analysis go a long way. Design a strong experiment if you can, collect the most relevant data possible if you can’t. The current literature on comparability is a little mixed. Some studies show some effects; others not. Most can be criticized for some design flaw or another. Not to worry, though. The schools will soon make clear what the most popular choice is.

Subscribe to:
Posts (Atom)