TrueScores: 2005

Wednesday, December 07, 2005

What If All Children Can’t Learn?

The fundamental assumptions of NCLB are: (1) all children can learn, (2) all children can learn to the same “rigorous” academic achievement standard, and (3) all students can learn to the same standard in the same arbitrary timeline.

I recently attended the CASMA conference held at ACT in Iowa City. One of the important topics was “efficacy” of K-12 testing and NCLB. During this discussion, Bob Linn provided the “downside” of NCLB testing, while a “rejoinder” was offered by Kerri Briggs of the USDOE. Dr. Linn presented a slide show regarding AYP as required by NCLB and “six major problems of the NCLB accountability system.” Ms. Briggs provided defense of NCLB. Most interesting, at least from my seat, were not the points raised by Dr. Linn or Ms. Briggs, but the fact that no one was talking about or debating the fundamental premise of NCLB that all children can learn. In fact, I seemed to be the only one on the discussant panel who raised the issue. As such, I wanted to repeat here.

First, I have pondered long and hard about the first premise—all students can learn. I have pretty much resolved this issue in my own mind, and I do believe all students can learn—something—given enough time. I have doubts (grave doubts) that all students can learn to the same “rigorous” performance standard, and I know they will not be able to do this on the same arbitrary timeline.

That said, I find two additional things curious about these fundamental assumptions of NCLB. It would be hard to win a debate opposing the first assumption. How could you NOT believe all students can learn? I can see the discussion now: “You mean you call yourself an educator, but you admit you cannot teach ALL of the children?” To put it another way, is it O.K. to leave say one child behind? If not, then we MUST be able to teach all students.

The second thing I find curious would be the logical conclusion that NCLB denies the existence of individual differences. I see individual differences all the time: differences in motivation, differences in levels of preparation, and differences in levels of achievement. If all students can learn to the same rigorous academic standards in the same time frame, why are there still noticeable differences in student achievement? What if all children can’t learn under these assumptions?

Thursday, October 27, 2005

True/False, Three of a Kind, Four or More?

I seldom publish my thoughts or reactions to research until after I have lived with, wrestled with, and clearly organized them. I have found that this helps protect my “incredibly average” intellect and keeps me from looking foolish. However, when it comes to the recent research of my good friend Dr. Michael Rodriguez (Three Options are Optimal for Multiple-Choice Items: A Meta-Analysis of 80 Years of Research. EM:IP, Summer 2005), I can’t help but share some incomplete thoughts.

Michael’s research is well conducted, well documented, well thought out, and clearly explained. Yet, I still can’t help but think his conclusions are far too strong based on his research evidence. Forget about what you believe or don’t believe regarding meta-analytic research. Forget about the sorry test questions you have encountered in the past. Forget about Dr. Rodriguez being an expert in research regarding the construction of achievement tests. Instead, focus on the research presented, the evidence posted, and the conclusions made.

For example, consider “…the potential improvements in tests through the use of 3-option items enable the test developer and user to strengthen several aspects of validity-related arguments.” (pg. 4). This is some strong stuff and has several hidden implications. First, is the assumption that a fourth or fifth option must not be contributing very much to the overall validity of the assessment. Perhaps—but how is this being controlled in the research? Second is the assumption that the time saved in moving to three options will allow for more test questions in the allotted time, leading to more content coverage. I speculate that there might not be nearly the time savings the author thinks. Third is the assumption that all distractors must function in an anticipated manner. For example, Dr. Rodriguez reviewed the literature and found that most definitions of functional distractors required certain levels of item-to-total correlation and a certain minimum level of endorsement (often at least five percent of the population), among other attributes. It is unlikely in a standards-referenced assessment that all of the content will allow such distractor definitions. Hopefully, as more of the standards and benchmarks are mastered, more and more of the population choosing the incorrect response options (regardless of how many) will decrease, essentially destroying such operational definitions of “good distractors.”

Finally, another area of concern I have with the strong conclusion that three options are optimal is the fact that this was a meta-analytic study. As such, all of the data came from existing assessments. I agree with the citation from Haladyna and Downing (Validity of a Taxonomy of Multiple-choice Item Writing Rules. APM, 2, 51-78) stating that the key is not the number of options, but the quality of options. As such, does the current research mean to imply that given the lack of good distractors beyond three that three distractors are best? Or, does it mean that given five equally valuable distractors that three are best? If the former, would it not make more sense to write better test questions? If the later, then is not controlled experimentation required?

Please do not mistake my discussion as a criticism of the research. On the contrary, this research has motivated me to pay more attention to things I have learned long ago and that I put into practice almost daily. This is exactly what research should do, generate discussion. I will continue to read and study the research and perhaps, in the near future, you will see another post from me in this regard.

Monday, October 10, 2005

Are All Tests Valid?

I recently attended the William E. Coffman Lecture Series at the University of Iowa that featured Dr. Linda Crocker author and professor emeritus in educational psychology at the University of Florida. The title of her talk was "Developing Large-Scale Science Assessments Beneath Storm Clouds of Academic Controversy and Modern Culture Clashes," which really means “Developing Large Scale Science Assessments.” Linda has been a consistent proponent of content validity research and her lecture was elegant, simple, and to the point—and of course, based on content. You may recall a previous TrueScores post that reexamined the “content vs. consequential” validity arguments that parallel the points made by Dr. Crocker, though she made a much more elegant and logical argument in three parts:

Assessments may survive many failings, but never a failing of quality content;

Evidence of “consequential validity” should not replace evidence of content validity; and

Bob Ebel was correct: psychometricians pay too little attention to test specifications.

First, we as a profession often engage in countless studies regarding equating, scaling, validity, reliability, and such. We perform field test analyses to help remove “flawed items” using statistical parameters. How often, however, do we suggest that “content must rule the day” and compromise our statistical criteria in order to get better measures of content? Should we? Second, do we place items on tests to “drive curriculum” (consequential validity evidence), or do we place items on an assessment to index attainment of content standards and benchmarks (content validity evidence)? If not for content validity, why not? Finally, how often have we done what Professor Ebel suggested with his notion of “duplicate construction experiments” (Ebel, R. L., 1961, Must All Tests be Valid? American Psychologist, 16, 640-647)? In other words, have we ever experimentally tested a table of specifications to see if the items constructed according to these specifications yield parallel measures? Why not?

It seems to me that we have taken “quality content” for granted. This has also added “fuel to the fire” regarding the perception of testing in general (and high stakes testing in particular) being watered down and trivializing what is important. Perhaps I am wrong, but instead of spending hours writing technical manuals full of reliability data, item performance data, scaling and equating data, we should spend at least as much time justifying the quality of the content on these assessments. Often we do, but do we do it consistently and accurately? I have seen 1,000-page technical manuals, but such documents justifying content always seem to be locked away or “on file.” Why?

Tuesday, October 04, 2005

University of Maryland Conference on Value Added Models

Last year, the University of Maryland held a conference on Value Added assessment models which was timely, relevant and informative for educators contemplating both growth modeling as well as value added assessment. In fact, the conference presentations were collected together and put into a book to memorialize the conference as well as the shared wisdom (Conference Proceedings 2004) as edited by the conference organizer Dr. Robert Lissitz.

Because of the success of the conference last year and due to the continued discussion regarding growth modeling, value added assessment and AYP (even the FED are considering the use of growth models in NCLB accountability), the conference is on again. While I doubt many of you will plan to attend, please see the conference agenda so you can keep up on the research associated with growth models (2005 Maryland Conference on Value Added Models) and look for the proceedings of this 2005 conference.

Monday, September 12, 2005

In The News: Electronic Testing

I read with interest the claims made in Oregon recently that online testing actually caused a raise in student test scores: Online Testing Helped Raise Scores. “Oregon students of all ages showed across-the-board improvements on state tests in core subjects....” Good for Oregon and good for electronic testing! I am a big believer in online assessment. Further reading, however, caused me to pause. I began wondering about the security of the online assessments in Oregon. An Oregon official is quoted as claiming, “Web-based testing is more secure…” and “…students and teachers get immediate results…delivered automatically....” All of this raised concerns about security. Surely, someone was looking at the results before they were returned? What evidence does Oregon offer to support their claims of security? Apparently, teachers had multiple opportunities to test their students “throughout the school year!” “Often, teachers will space out the testing periods over a school year and use early test results to evaluate which material a student needs to focus on in order to reach grade-level proficiency.” Does this mean the forms are exposed the entire school year?

Being a skeptical reader, as you should be, and looking for evidence to support their claims, I went to the Oregon Department of Education website for information regarding the security of the system and perhaps some technical information. After nearly 30 minutes I gave up. It is not that I don’t believe press releases, newspapers, or other non-peer reviewed information—well, I guess it is because I don’t—but it is not that I don’t believe Oregon educators. I just wish we could have some factual evidence associated with such claims. Only this way will we truly be able to judge for ourselves what works and what does not.

Tuesday, August 23, 2005

Reading IS Fundamental

The Second Annual Lexile© National Reading Conference is in the books...no pun intended. I attended, as I did last year, and was once again impressed by what I discovered. Sure, Greg Cizek's presentation about "Testing Myths" was enjoyable and informative. My presentation criticizing NCLB not allowing "off level" reading assessment was novel if not interesting (though it was well attended). Quality Quinn, Lou Fabrizio and Malbert Smith all provided very informative and instructional presentations. All of these were worth the price of admission alone. However, what impressed me the most was the desire of the attendees to read! Teachers were buying books, with their own money, to give to "troubled readers" in their classrooms. Malbert Smith talked about that "parasite" we have in our homes, the television, that robs us of intellect. Reading teachers agreed that the best way to teach reading was to get children to read. Assessment developers understood the needs of the reading specialists! In short, it was utopia. Well, short of utopia, it was very exciting to see people paying attention to reading and reading instruction. I hope you can attend next year, but in the mean time...pay attention to reading.

Thursday, August 04, 2005

Is It Fact or Process?

A friend of mine and I were recently discussing some aspect of mathematics instruction. I wanted to talk about the "number line" and he wanted to talk about "math facts". Perhaps I was being a bit ornery (quite contrary actually), but I looked at him with a blank stare and asked what he meant by "facts." He said, "You know...facts, like two times two is equal to four." Since I started down this path, I continued. So I replied, "Well, actually, two times two is really a concept. The concept of the successive addition of two for a total of two cycles." He became quite agitated and said, "No! It is a fact, you either know it or you don't." That is when I drew a matrix of 1-9 across the top of a piece of paper and 1-9 down the side and showed him how this matrix provided the "facts" he claimed without really "knowing" anything (other than how to draw the matrix). My friend then noticed that I was trying to teach him about process and he was trying to teach me about facts, and that we were getting no where fast. As such, he changed the topic to history, which I am sure he thought was a safe subject. "Math is no different than history," he said. "It's all about knowing the facts and sequencing them correctly." I said, "Really? Then if you list the League of Nations before the United Nations on some timeline you have demonstrated knowledge of history?" My friend was skeptical (and annoyed) and did not answer. I told him that in reality, it might very well be important to understand the impact the League of Nations had on the development of the United Nations if you were going to use history to help understand current events and/or future events.

At this point we decided to end the conversation before anyone got really mad. In departing, he did take one last shot. He said, "It's just like with testing...all you have to do is figure out if the kids know the facts." And I asked him, "Perhaps, but what process do you want to use? Multiple-choice, short answer, essay...?"

Perhaps the next time I see my friend we will talk religion...it will likely lead to a simpler discussion!

Monday, July 25, 2005

The Consequences of Consequential Validity

Recently I was in a meeting with some of our leaders in educational measurement when a discussion regarding how to present consequential validity evidence ensued. Having received most of my instruction, readings and otherwise general nurturing from several "disciples" of Bob Ebel, I was fascinated. I recalled the 1997 Educational Measurement: Issues and Practice volume (Vol. 16, No. 2, Summer) dedicated to this topic. An oversimplification of the debate, both then and now, seems to be around the definition of consequential validity itself. Bill Mehrens, in his 1997 article, articulates:

"I suggest that the psychometric community narrow the use of the term validity rather than expand it. Let us reserve the term for determining the accuracy of inferences about (and understanding of) the characteristic being assessed, not the efficacy of actions following assessment."

Professor Mehrens continues:

"The consequences of a particular use do not necessarily inform us regarding either the meaning of a construct or the adequacy of a particular assessment process in measuring that construct."

This group of measurement experts, to which I referred in the opening, debated such statements (and others) for quite some time. Having no resolution, but acknowledging that the consequences of the use of scores resulting from an assessment are important, they recommended that we follow the Standards regarding presentation of consequential validity evidence.

For the life of me, I can find no reference to general "consequential validity" in the Standards at all. I can find some references to "unintended consequences," but the notion of general consequences for test score use is not specifically addressed. Perhaps, even if it is only a memorialization of the debate, the next edition of the Standards will include it.

Monday, June 20, 2005

Formative Assessments Link Measurement and Instruction

Many presentations at the annual CCSSO conference on Large Scale Assessment, taking place here in sunny San Antonio, reference the need to expand testing to include more formative assessments.

Pearson Educational Measurement has our own set of formative assessments know as PASeries. A white paper describing how PASeries was developed and how it might be used to improve student learning, as well as other information regarding PASeries, are available here at the conference and are quite popular. Check out PASeries and see for yourself.

Wednesday, June 15, 2005

Models, Measurement and Learning

For years, the measurement community has debated which of a virtually limitless number of mathematical models is most appropriate for a given measurement activity. I remember the very heated discussion between Ben Wright and Ron Hambleton at an AERA/NCME conference not too long ago. Ben spoke of "objective measurement." Ron spoke of "representing the integrity" of measurement practitioners. Both sides had their points and the debate still continues in some circles.

In this age of "standards referenced" assessment, the selection of a measurement model might not be an academic debate only. Think for a minute about a standards referenced test where six of the 60 items come from a particular content domain (say Mathematical Operations). For the teacher, student and accountability folks, this means that 10% of the emphasis of the test is on operations (6/60 = 10 percent). However, measurement practitioners know that by selecting an IRT measurement model (other than Rasch), each of these items are weighted based on the slope parameter. Pattern scoring will then essentially guarantee that 10% of the test is not assigned to mathematical operations. This is because the contributions of these items to the total measure (the resulting theta value) will be weighted by the discrimination (slope) parameter. So, if the operation items do not do a good job in discriminating between high and low overall test scorers, these items are likely to contribute far less than expected to the total ability measure. While this might not be as big a deal when number correct scoring is used, the effects of this weighting are still present in research (equating and scaling).

I think we as measurement experts need to be cognizant of some distinctions when debating psychometric issues. First there is the psychometric or mathematical aspect: Which model fits the data better? Which is most defensible and practical? And which is most parsimonious? Often, I fear, psychometricians decide before seeing the data (with almost religious zeal) which model is "correct." The second aspect is one of instruction: Are we measuring students in the way their cognitive processes function? Are we controlling for irrelevant variance? And do our measures make sense in the context? Often, I think, psychometricians are too quick to compromise on measures without fully understanding constructs. Finally, we need to consider the learning aspect: Are we measuring what is being taught in the way it is being taught or are we doing something else? Are we as psychometricians measuring what is being taught, the stated curriculum, or something else (speededness for example)? Without considering these aspects, at a minimum, we are likely to argue for mathematical models that might not be helpful for our mission of improved student learning.

Just one man's opinion...

Wednesday, May 25, 2005

Those Pesky Performance Standards

I am never failed to be amazed by the discussions my colleagues and I engage in regarding what psychometricians call "standard setting". The essence of standard setting is to determine "how much is enough" regarding the performance on some measure, and to do so in a less than capricious manner (still arbitrary, but not capricious).

Nevertheless, rooms filled with content experts, testing experts, psychometricians (some of whom are experts), standard setting experts, and others engage in countless banter regarding how to plan for, control, and analyze the data resulting from (or going into) a standard setting as if the data was anything less than an arbitrary (though often not capricious) judgment.

Perhaps I am finally too old to enjoy such arbitrary distinctions anymore. Understand that I am not saying that standard setting is not important, that the established procedures should not be used or that we should not carefully plan and implement the standard setting in the best way possible following standards of best practice. I think all of this should be done. I am just not sure all of the research and rhetoric using the results or outcomes of such judgmental procedures are worth the efforts they require to discuss.

One person's opinion...of course.

Friday, May 20, 2005

Life Long Learning

Over the years, I can recall various conversations regarding student growth, preparedness and remediation. They go something like the following:

First Grade Teacher: These kids have no social skills at all. Why can't the parents do more to get their kids ready for school?

Third Grade Teacher: These kids don't know the alphabet or their math facts. Why can' t the earlier grade teachers do more?

High School Teacher: These young people don't have any of the prerequisite math skills. Why can't the middle school teachers do more?

College Instructor: Half of our entering freshmen are in remediation. Why can't the high school teachers do more?

Educators and the public alike often talk about a K-16 or K-20 system of education in this country. In fact, just last week a retired professor of mine talked about being a "life long student" and how the biggest pleasure he gets in life is the fun in finding things out. Yet, our educational systems are quick to "pass the blame" onto what has gone before. It seems to me that a more integrated system of learning, including measurement of skills from K-20 might make it easier to debunk (or at least put into perspective) the gaps students have in their pre-requisite skills as they move from kindergarten to college.

One interesting step in this area is the use of "college readiness" indicators as part of the state mandated assessment system. Texas has recently required such an indicator.

Preliminary results of the research supporting this effort (as conducted by Pearson Educational Measurement in coordination with the Texas Education Agency) is also presented.

Tuesday, May 03, 2005

How It All Started

T = X - E

Recall that one of the fundamental derivations of "strong true score" or "classical" measurement theory is that an examinee's unknown and unseen "true score" (T) is really their observed score (X) on an assessment minus error (E). Since the development of this concept (and even before) measurement practitioners and theorists to boot have been trying to estimate a student's true score with greater and greater precision. This maximization effort typically focuses on ways to partition the error (i.e., to better understand what is causing error) and ultimately reduce it such that observed student performance is a better indicator of underlying achievement or ability.

So, what does all this have to do with the TrueScores blog? Only in that it serves to mention that Pearson Educational Measurement has recently expanded our research efforts and intend to use the TrueScores blog as one of the forums for dissemination and debate. The last thing the world needs is another forum for a pompous psychometrician to pontificate about how the world would be a better place if y'all would only buy their solution. To this end, the TrueScores blog is dedicated to honest, respectful, scientifically based and open debate about the "hot" topics in today's measurement world. Some of these topics include:

Establishing comparability between paper-and-pencil assessments and their online or electronic counterparts.
Automated essay scoring: Is it practical, reliable and valid?
Is Computer Adaptive Testing (CAT) a potential solution for the age old question of testing time versus instructional time?

Background information related to these topics can be found at our web site on the research pages. Additional publications related to a host of topics in educational measurement can also be found at our web site. Future research will be added periodically and we will use this blog to communicate these additions.

We will be updating this blog so that a new discussion topic will be posted regularly. This will add to the debate shaping our educational policy and will provide practical and applied insights into not only classical measurement but other aspects of educational measurement including Item Response Theory, Growth Modeling (Value Added Models), Equating, Scaling and legal defensibility. As such, we hope you return.

In the mean time, if you have questions about Pearson Educational Measurement, or our parent company Pearson Education, start by visiting our home page.

TrueScores