TrueScores: March 2010

Wednesday, March 10, 2010

Some Thoughts About Ratings…

I spend a lot of time thinking about ratings. One reason I spend so much time thinking about ratings is that I’ve either assigned or been subjected to ratings many times during my life. For example, I review numerous research proposals and journal manuscripts each year, and I assign ratings that help determine whether the proposed project is funded or manuscript is published. I have entered ratings for over 1,000 movies into my Netflix database, and in return, I receive recommendations for other movies that I might enjoy. My wife is a photographer, and one of my sons is an artist, and they enter competitions and receive ratings through that process with hopes of winning a prize. My family uses rating scales to help us decide what activities we’ll do together—so much so that my sons always ask me to define a one and a ten when I ask them to rate their preferences on a scale of one to ten.

In large-scale assessment contexts, the potential consequences associated with ratings are much more serious than these examples, so I’m surprised at the relatively limited amount of research that has been dedicated to studying the process and quality of those ratings over the last 20 years. While writing this, I leafed through a recent year of two measurement journals, and I found only three articles (out of over 60 published articles) relating to the analysis of ratings. I’ve tried to conduct literature reviews on some topics relating to large-scale assessment ratings for which I have found few, if any, journal articles. This dearth of research relating to ratings troubles me when I think about the gravity of some of the decisions that are made based on ratings in large-scale assessment contexts and the difficulty of obtaining highly reliable measures from ratings (not to mention the fact that scoring performance-based items is an expensive undertaking).

Even more troubling is the abandonment, by some, of the entire notion of using assessment formats that require ratings because of these difficulties. This is an unfortunate trend in large-scale assessment, because there are many areas of human performance that simply cannot be adequately measured with objectively scored items. The idea of evaluating written compositions skills, speaking skills, artistic abilities, and athletic performance with a multiple-choice test seems downright silly. Yet, that’s what we would be doing if the objective of the measurement process was to obtain the most reliable measures. Clearly, in contexts like this, the authenticity of the measurement process is an important consideration—arguably as important as the reliability of the measures.

So, what kinds of research need to be done relating to the analysis of ratings in large-scale assessment contexts? There are numerous studies of psychometric models and statistical indices that can be utilized to scale ratings data and to identify rater effects. In fact, all three of the articles that I mentioned above focused on such applications. However, studies such as those do little to contribute to the basic problems associated with ratings. For example, very few studies exist that examine the decision making process that raters utilize when making rating decisions. There are also very few studies of the effectiveness of various processes for training raters in large-scale assessment projects—see these three Pearson research reports for examples of what I mean: Effects of Different Training and Scoring Approaches on Human Constructed Response Scoring, A Comparison of Training & Scoring in Distributed & Regional Contexts - Writing , A Comparison of Training & Scoring in Distributed & Regional Contexts - Reading. Finally, there are almost no studies of the characteristics of raters that make them good candidates for large-scale assessment scoring projects. Yet, the basis of most of the decisions that are made by those who run scoring projects focus on these three issues: Who should score, how should they be trained, and how should they score? It sure would be nice to make better progress toward answering these three questions over the next 20 years than we have during the past 20.

Edward W. Wolfe, Ph.D.
Senior Research Scientist
Assessment & Information
Pearson

Wednesday, March 03, 2010

An ATP Newbie Reflects…

I walked into the annual meeting of the Association of Test Publishers (ATP) opening shindig (appropriately Superbowl-themed on 2/7/10 – congrats Saints!) and was struck by déjà vu. I eerily felt the same trepidation and bemusement as at my first educational conference back in 2000. Despite many years in assessment, I knew very few people. It was only later that I realized who the players were and that these were influential industry leaders—professors I had studied in college, text book authors I was required to read, people I had observed giving presentations across the country—competitors and colleagues. It occurred to me that they had much in common with me and I began to relax.

The opening session The Opening Session introduced Scott Berkun, author of “The Myths of Innovation”, who challenged attendees -- What is innovation and how does it REALLY happen? I thought of Edison’s “Genius is 1% inspiration and 99% perspiration”. Berkun’s message (chapter 7 of his book) was: throughout history there were few “epiphany” (“ah ha”) moments -- more trying ideas and doggedly pursuing them until success was achieved. Failures are buried in the annals of time like Roman architecture other than the Coliseum...Keep asking -- What is innovation and is this it?

I enjoyed sessions on innovative items in assessment. “Assessing the Hard Stuff with Innovative Items” Assessing the Hard Stuff with Innovative Items which covered approaches from Medical Examiners, Certified Public Accountants, Medical Sonographers and Architects. The simulation rich examples and expanded item types (e.g., interactive tasks; expanded response options like drop down lists, forms/notes/orders, drawing/annotation tools; and interactive response options like hotspots and drag-and-drops) were interesting to consider. “Are You Ready for Innovative Items” Are You Ready for Innovative Items was a how-to on considerations for implementing innovative items and really outlined the potential pitfalls in innovation. The first was more intellectually interesting but the second was a good overview for those of you new to innovative item formats.

The Education division meeting was another interesting event. As newly appointed Secretary, I was surprisingly asked to step into the Vice Chair role. WOW, nothing like a promotion when you attend your first conference --or a foreshadowing of how much work we need to do as a group. Steve Lazer from ETS accepted the Chair role and Jim Brinton of Certification Management Services for volunteered for Secretary. Now we have a full slate of officers ready to serve!

Despite our commitment to service, the Education division appears to suffer from an identity crisis. We discussed how to increase ATP membership and conference attendance but I failed to see the value proposition of membership for all groups. This is a trade association that should be working for us -- its members. I am puzzled and concerned by the discussion about the inability for state government entities (acting as publishers) to join -- since this is a trade organization. However, moving forward I hope to better understand the mission and goals of the Education division so I can help resolve this identity crisis!

Respectfully submitted,
Karen Squires Foelsch
VP, Content Support Services
(ATP neophyte and new TrueScores blogger)

TrueScores

Wednesday, March 10, 2010

Some Thoughts About Ratings…

Wednesday, March 03, 2010

Search This Blog

Pearson's Test, Measurement & Research Services

Blog Archive

Followers

Copyright © 2010 Pearson Education, Inc. or its affiliate(s). All rights reserved.