I spend a lot of time thinking about ratings. One reason I spend so much time thinking about ratings is that I’ve either assigned or been subjected to ratings many times during my life. For example, I review numerous research proposals and journal manuscripts each year, and I assign ratings that help determine whether the proposed project is funded or manuscript is published. I have entered ratings for over 1,000 movies into my Netflix database, and in return, I receive recommendations for other movies that I might enjoy. My wife is a photographer, and one of my sons is an artist, and they enter competitions and receive ratings through that process with hopes of winning a prize. My family uses rating scales to help us decide what activities we’ll do together—so much so that my sons always ask me to define a one and a ten when I ask them to rate their preferences on a scale of one to ten.
In large-scale assessment contexts, the potential consequences associated with ratings are much more serious than these examples, so I’m surprised at the relatively limited amount of research that has been dedicated to studying the process and quality of those ratings over the last 20 years. While writing this, I leafed through a recent year of two measurement journals, and I found only three articles (out of over 60 published articles) relating to the analysis of ratings. I’ve tried to conduct literature reviews on some topics relating to large-scale assessment ratings for which I have found few, if any, journal articles. This dearth of research relating to ratings troubles me when I think about the gravity of some of the decisions that are made based on ratings in large-scale assessment contexts and the difficulty of obtaining highly reliable measures from ratings (not to mention the fact that scoring performance-based items is an expensive undertaking).
Even more troubling is the abandonment, by some, of the entire notion of using assessment formats that require ratings because of these difficulties. This is an unfortunate trend in large-scale assessment, because there are many areas of human performance that simply cannot be adequately measured with objectively scored items. The idea of evaluating written compositions skills, speaking skills, artistic abilities, and athletic performance with a multiple-choice test seems downright silly. Yet, that’s what we would be doing if the objective of the measurement process was to obtain the most reliable measures. Clearly, in contexts like this, the authenticity of the measurement process is an important consideration—arguably as important as the reliability of the measures.
So, what kinds of research need to be done relating to the analysis of ratings in large-scale assessment contexts? There are numerous studies of psychometric models and statistical indices that can be utilized to scale ratings data and to identify rater effects. In fact, all three of the articles that I mentioned above focused on such applications. However, studies such as those do little to contribute to the basic problems associated with ratings. For example, very few studies exist that examine the decision making process that raters utilize when making rating decisions. There are also very few studies of the effectiveness of various processes for training raters in large-scale assessment projects—see these three Pearson research reports for examples of what I mean: Effects of Different Training and Scoring Approaches on Human Constructed Response Scoring, A Comparison of Training & Scoring in Distributed & Regional Contexts - Writing , A Comparison of Training & Scoring in Distributed & Regional Contexts - Reading. Finally, there are almost no studies of the characteristics of raters that make them good candidates for large-scale assessment scoring projects. Yet, the basis of most of the decisions that are made by those who run scoring projects focus on these three issues: Who should score, how should they be trained, and how should they score? It sure would be nice to make better progress toward answering these three questions over the next 20 years than we have during the past 20.
Edward W. Wolfe, Ph.D.
Senior Research Scientist
Assessment & Information