Thursday, October 27, 2005

True/False, Three of a Kind, Four or More?

I seldom publish my thoughts or reactions to research until after I have lived with, wrestled with, and clearly organized them. I have found that this helps protect my “incredibly average” intellect and keeps me from looking foolish. However, when it comes to the recent research of my good friend Dr. Michael Rodriguez (Three Options are Optimal for Multiple-Choice Items: A Meta-Analysis of 80 Years of Research. EM:IP, Summer 2005), I can’t help but share some incomplete thoughts.

Michael’s research is well conducted, well documented, well thought out, and clearly explained. Yet, I still can’t help but think his conclusions are far too strong based on his research evidence. Forget about what you believe or don’t believe regarding meta-analytic research. Forget about the sorry test questions you have encountered in the past. Forget about Dr. Rodriguez being an expert in research regarding the construction of achievement tests. Instead, focus on the research presented, the evidence posted, and the conclusions made.

For example, consider “…the potential improvements in tests through the use of 3-option items enable the test developer and user to strengthen several aspects of validity-related arguments.” (pg. 4). This is some strong stuff and has several hidden implications. First, is the assumption that a fourth or fifth option must not be contributing very much to the overall validity of the assessment. Perhaps—but how is this being controlled in the research? Second is the assumption that the time saved in moving to three options will allow for more test questions in the allotted time, leading to more content coverage. I speculate that there might not be nearly the time savings the author thinks. Third is the assumption that all distractors must function in an anticipated manner. For example, Dr. Rodriguez reviewed the literature and found that most definitions of functional distractors required certain levels of item-to-total correlation and a certain minimum level of endorsement (often at least five percent of the population), among other attributes. It is unlikely in a standards-referenced assessment that all of the content will allow such distractor definitions. Hopefully, as more of the standards and benchmarks are mastered, more and more of the population choosing the incorrect response options (regardless of how many) will decrease, essentially destroying such operational definitions of “good distractors.”

Finally, another area of concern I have with the strong conclusion that three options are optimal is the fact that this was a meta-analytic study. As such, all of the data came from existing assessments. I agree with the citation from Haladyna and Downing (Validity of a Taxonomy of Multiple-choice Item Writing Rules. APM, 2, 51-78) stating that the key is not the number of options, but the quality of options. As such, does the current research mean to imply that given the lack of good distractors beyond three that three distractors are best? Or, does it mean that given five equally valuable distractors that three are best? If the former, would it not make more sense to write better test questions? If the later, then is not controlled experimentation required?

Please do not mistake my discussion as a criticism of the research. On the contrary, this research has motivated me to pay more attention to things I have learned long ago and that I put into practice almost daily. This is exactly what research should do, generate discussion. I will continue to read and study the research and perhaps, in the near future, you will see another post from me in this regard.

Monday, October 10, 2005

Are All Tests Valid?

I recently attended the William E. Coffman Lecture Series at the University of Iowa that featured Dr. Linda Crocker author and professor emeritus in educational psychology at the University of Florida. The title of her talk was "Developing Large-Scale Science Assessments Beneath Storm Clouds of Academic Controversy and Modern Culture Clashes," which really means “Developing Large Scale Science Assessments.” Linda has been a consistent proponent of content validity research and her lecture was elegant, simple, and to the point—and of course, based on content. You may recall a previous TrueScores post that reexamined the “content vs. consequential” validity arguments that parallel the points made by Dr. Crocker, though she made a much more elegant and logical argument in three parts:

Assessments may survive many failings, but never a failing of quality content;

Evidence of “consequential validity” should not replace evidence of content validity; and

Bob Ebel was correct: psychometricians pay too little attention to test specifications.


First, we as a profession often engage in countless studies regarding equating, scaling, validity, reliability, and such. We perform field test analyses to help remove “flawed items” using statistical parameters. How often, however, do we suggest that “content must rule the day” and compromise our statistical criteria in order to get better measures of content? Should we? Second, do we place items on tests to “drive curriculum” (consequential validity evidence), or do we place items on an assessment to index attainment of content standards and benchmarks (content validity evidence)? If not for content validity, why not? Finally, how often have we done what Professor Ebel suggested with his notion of “duplicate construction experiments” (Ebel, R. L., 1961, Must All Tests be Valid? American Psychologist, 16, 640-647)? In other words, have we ever experimentally tested a table of specifications to see if the items constructed according to these specifications yield parallel measures? Why not?

It seems to me that we have taken “quality content” for granted. This has also added “fuel to the fire” regarding the perception of testing in general (and high stakes testing in particular) being watered down and trivializing what is important. Perhaps I am wrong, but instead of spending hours writing technical manuals full of reliability data, item performance data, scaling and equating data, we should spend at least as much time justifying the quality of the content on these assessments. Often we do, but do we do it consistently and accurately? I have seen 1,000-page technical manuals, but such documents justifying content always seem to be locked away or “on file.” Why?

Tuesday, October 04, 2005

University of Maryland Conference on Value Added Models

Last year, the University of Maryland held a conference on Value Added assessment models which was timely, relevant and informative for educators contemplating both growth modeling as well as value added assessment. In fact, the conference presentations were collected together and put into a book to memorialize the conference as well as the shared wisdom (Conference Proceedings 2004) as edited by the conference organizer Dr. Robert Lissitz.

Because of the success of the conference last year and due to the continued discussion regarding growth modeling, value added assessment and AYP (even the FED are considering the use of growth models in NCLB accountability), the conference is on again. While I doubt many of you will plan to attend, please see the conference agenda so you can keep up on the research associated with growth models (2005 Maryland Conference on Value Added Models) and look for the proceedings of this 2005 conference.