Monday, January 28, 2008

If He is Correct, Then I Must Have Been Wrrrrrrrong(?)

I have had the occasion to know Dr. James Popham for many years and in many contexts. You might recall that Dr. Popham was the keynote speaker at the last ACT-CASMA conference and that I had devoted some space to the conference in a previous post.

Now please understand, Dr. Popham has worked in measurement for many years and describes himself as a "reformed test builder," presumably implying some sort of 12 step program. Despite this, or at least as a prelude to this, Dr. Popham has been very influential in assessment. He was an expert witness in the landmark "Debra P." case in Florida, and was involved in the early days of teacher certification in Texas and elsewhere. He is also the author of numerous publications.

Over the years I have listened to Jim says some outrageous things. For those of you who know Jim, this is no surprise. He is quite a presenter and, I suspect, basks a little too much in the glow of his own outrageousness. However, many of the things I have heard him say (at the Florida Educational Research Association-FERA meeting, for example) were just plain incorrect. I won't bother you with the specifics as I am sure Dr. Popham would claim he is correct. Yet, it does put me in a quandary. Despite his recent statements, I actually have to agree with what Dr. Popham said at the ACT-CASMA conference back in November.

Jim's theme—one he has articulated in multiple venues—regarded what he calls "instructional sensitivity." Here are the basic tenets of his argument:

"A cornerstone of test-based educational accountability:
Higher scores indicate effective instruction; lower scores indicate the opposite."

"Almost all of today's accountability tests are unable to ascertain instructional quality. That is, they are instructionally insensitive."

"If accountability tests can't distinguish among varied levels of instructional quality, then schools and districts are inaccurately evaluated, and bad educational things happen in classrooms."

I keep returning to this theme. While I make a living building assessments of all types, recently most of my efforts and those of my colleagues have been with assessments supporting NCLB, which are "instructionally insensitive" according to Dr. Popham. It is hard to believe that any assessment that asks three or four questions regarding a specific aspect of the content standards or benchmarks (and by the way does so only once a year) can be very sensitive to changes in student behavior due to instruction on that content. At the same time, having some experience teaching, testing, and improving student learning, I have seen the power that measures just like these have for teachers who know what to do with the data and have a plan to improve instruction.

Hence my dilemma: why do I keep returning to Dr. Popham's argument? While I am not ready to admit I might have been wrong to dismiss Jim as a "reformed test builder" and to ignore his rants, I do admit he has a valid point to some extent regarding instructional sensitivity. I suppose I would have called his argument "the instructional insensitivity of large-scale assessments," but who am I to quibble with vocabulary.

Dr. W. James Popham, Professor Emeritus from UCLA welcomes all "suggestions, observations, or castigations regarding this topic...." Contact him at Or send an email to, and I will forward it to him.

Friday, January 18, 2008

IQ and the Flynn Effect

Back in the 1980s when I worked on the development of the Wechsler Intelligence Scale for Children, Third Edition (WISC-III), I was fascinated with a process commonly referred to at the time as "continuous norming." Applied by Dr. Gale Roid as developed by Professor Richard Gorsuch, continuous norming was a slick way to improve the precision of empirical norms. While things seemed to get in the way of any in-depth analysis of the procedure, and while I did stay in contact with Professor Gorsuch occasionally, I did nothing to understand or apply the process anew and simply moved on.

Over the winter holidays, I was reading The New Yorker (yes, even people who live in Iowa read The New Yorker) and discovered, much to my surprise, a story about IQ written by Malcolm Gladwell, titled "None of the Above: What IQ doesn’t tell you about race" (December 17, 2007, pp. 92-96). As you may recall, Malcolm Gladwell is the author of both The Tipping Point and Blink. Both books interested me, so I read what he had to say about IQ.

Gladwell references something he (and apparently others) call “The Flynn Effect.” The Flynn Effect comes from James Flynn, author of What is Intelligence?, and is essentially the term used to describe what Flynn claims to have discovered—that all humans are getting smarter. As Gladwell points out, Flynn looked at years of IQ assessment data from all over the world and concluded that humans gain three IQ points per decade. Gladwell then tries to put this in context. For example, if Americans' average IQ in 2000 was 100, then in 1990 it was 97, in 1980 it was 94, in 1970 it was 91, and so on. If true, this implies that my grandfather (and yours) were “dull normals” at best, but were most likely mentally retarded. Flynn claims that this is due more to the way we measure intelligence than anything else. He states, as Gladwell points out:

“An IQ, in other words, measures not so much how smart we are as how modern we are.”
For example, when members of the Kepelle tribe in Liberia were asked to associate objects such as a potato and a knife, they linked them together according to function. As Gladwell points out, after all, you use a knife to cut a potato. Most IQ assessments would expect the potato to be linked to other legumes and the knife to be linked to other tools. Flynn claims modern culture has “taught” us to think in the way the IQ assessment measures and, while this is different than how the Kepelle thought, there is no reason to believe that their thinking represents anything less intelligent.

Gladwell then tries to articulate the issue that Flynn makes regarding intelligence test norms. He observes that if the center of each new edition of the WISC is 100, and everyone is getting smarter by three IQ points per decade, than each subsequent form of the WISC (the first WISC was standardized in the 1940s) must be getting harder. Very interesting—I need to dig up references on the “continuing norming” process used for the WISC and see what impact, if any, such a process might have on "The Flynn Effect."

You can comment on this posting by emailing