Monday, October 10, 2005

Are All Tests Valid?

I recently attended the William E. Coffman Lecture Series at the University of Iowa that featured Dr. Linda Crocker author and professor emeritus in educational psychology at the University of Florida. The title of her talk was "Developing Large-Scale Science Assessments Beneath Storm Clouds of Academic Controversy and Modern Culture Clashes," which really means “Developing Large Scale Science Assessments.” Linda has been a consistent proponent of content validity research and her lecture was elegant, simple, and to the point—and of course, based on content. You may recall a previous TrueScores post that reexamined the “content vs. consequential” validity arguments that parallel the points made by Dr. Crocker, though she made a much more elegant and logical argument in three parts:

Assessments may survive many failings, but never a failing of quality content;

Evidence of “consequential validity” should not replace evidence of content validity; and

Bob Ebel was correct: psychometricians pay too little attention to test specifications.

First, we as a profession often engage in countless studies regarding equating, scaling, validity, reliability, and such. We perform field test analyses to help remove “flawed items” using statistical parameters. How often, however, do we suggest that “content must rule the day” and compromise our statistical criteria in order to get better measures of content? Should we? Second, do we place items on tests to “drive curriculum” (consequential validity evidence), or do we place items on an assessment to index attainment of content standards and benchmarks (content validity evidence)? If not for content validity, why not? Finally, how often have we done what Professor Ebel suggested with his notion of “duplicate construction experiments” (Ebel, R. L., 1961, Must All Tests be Valid? American Psychologist, 16, 640-647)? In other words, have we ever experimentally tested a table of specifications to see if the items constructed according to these specifications yield parallel measures? Why not?

It seems to me that we have taken “quality content” for granted. This has also added “fuel to the fire” regarding the perception of testing in general (and high stakes testing in particular) being watered down and trivializing what is important. Perhaps I am wrong, but instead of spending hours writing technical manuals full of reliability data, item performance data, scaling and equating data, we should spend at least as much time justifying the quality of the content on these assessments. Often we do, but do we do it consistently and accurately? I have seen 1,000-page technical manuals, but such documents justifying content always seem to be locked away or “on file.” Why?

No comments: