Wednesday, December 07, 2005

What If All Children Can’t Learn?

The fundamental assumptions of NCLB are: (1) all children can learn, (2) all children can learn to the same “rigorous” academic achievement standard, and (3) all students can learn to the same standard in the same arbitrary timeline.

I recently attended the CASMA conference held at ACT in Iowa City. One of the important topics was “efficacy” of K-12 testing and NCLB. During this discussion, Bob Linn provided the “downside” of NCLB testing, while a “rejoinder” was offered by Kerri Briggs of the USDOE. Dr. Linn presented a slide show regarding AYP as required by NCLB and “six major problems of the NCLB accountability system.” Ms. Briggs provided defense of NCLB. Most interesting, at least from my seat, were not the points raised by Dr. Linn or Ms. Briggs, but the fact that no one was talking about or debating the fundamental premise of NCLB that all children can learn. In fact, I seemed to be the only one on the discussant panel who raised the issue. As such, I wanted to repeat here.

First, I have pondered long and hard about the first premise—all students can learn. I have pretty much resolved this issue in my own mind, and I do believe all students can learn—something—given enough time. I have doubts (grave doubts) that all students can learn to the same “rigorous” performance standard, and I know they will not be able to do this on the same arbitrary timeline.

That said, I find two additional things curious about these fundamental assumptions of NCLB. It would be hard to win a debate opposing the first assumption. How could you NOT believe all students can learn? I can see the discussion now: “You mean you call yourself an educator, but you admit you cannot teach ALL of the children?” To put it another way, is it O.K. to leave say one child behind? If not, then we MUST be able to teach all students.

The second thing I find curious would be the logical conclusion that NCLB denies the existence of individual differences. I see individual differences all the time: differences in motivation, differences in levels of preparation, and differences in levels of achievement. If all students can learn to the same rigorous academic standards in the same time frame, why are there still noticeable differences in student achievement? What if all children can’t learn under these assumptions?

Thursday, October 27, 2005

True/False, Three of a Kind, Four or More?

I seldom publish my thoughts or reactions to research until after I have lived with, wrestled with, and clearly organized them. I have found that this helps protect my “incredibly average” intellect and keeps me from looking foolish. However, when it comes to the recent research of my good friend Dr. Michael Rodriguez (Three Options are Optimal for Multiple-Choice Items: A Meta-Analysis of 80 Years of Research. EM:IP, Summer 2005), I can’t help but share some incomplete thoughts.

Michael’s research is well conducted, well documented, well thought out, and clearly explained. Yet, I still can’t help but think his conclusions are far too strong based on his research evidence. Forget about what you believe or don’t believe regarding meta-analytic research. Forget about the sorry test questions you have encountered in the past. Forget about Dr. Rodriguez being an expert in research regarding the construction of achievement tests. Instead, focus on the research presented, the evidence posted, and the conclusions made.

For example, consider “…the potential improvements in tests through the use of 3-option items enable the test developer and user to strengthen several aspects of validity-related arguments.” (pg. 4). This is some strong stuff and has several hidden implications. First, is the assumption that a fourth or fifth option must not be contributing very much to the overall validity of the assessment. Perhaps—but how is this being controlled in the research? Second is the assumption that the time saved in moving to three options will allow for more test questions in the allotted time, leading to more content coverage. I speculate that there might not be nearly the time savings the author thinks. Third is the assumption that all distractors must function in an anticipated manner. For example, Dr. Rodriguez reviewed the literature and found that most definitions of functional distractors required certain levels of item-to-total correlation and a certain minimum level of endorsement (often at least five percent of the population), among other attributes. It is unlikely in a standards-referenced assessment that all of the content will allow such distractor definitions. Hopefully, as more of the standards and benchmarks are mastered, more and more of the population choosing the incorrect response options (regardless of how many) will decrease, essentially destroying such operational definitions of “good distractors.”

Finally, another area of concern I have with the strong conclusion that three options are optimal is the fact that this was a meta-analytic study. As such, all of the data came from existing assessments. I agree with the citation from Haladyna and Downing (Validity of a Taxonomy of Multiple-choice Item Writing Rules. APM, 2, 51-78) stating that the key is not the number of options, but the quality of options. As such, does the current research mean to imply that given the lack of good distractors beyond three that three distractors are best? Or, does it mean that given five equally valuable distractors that three are best? If the former, would it not make more sense to write better test questions? If the later, then is not controlled experimentation required?

Please do not mistake my discussion as a criticism of the research. On the contrary, this research has motivated me to pay more attention to things I have learned long ago and that I put into practice almost daily. This is exactly what research should do, generate discussion. I will continue to read and study the research and perhaps, in the near future, you will see another post from me in this regard.

Monday, October 10, 2005

Are All Tests Valid?

I recently attended the William E. Coffman Lecture Series at the University of Iowa that featured Dr. Linda Crocker author and professor emeritus in educational psychology at the University of Florida. The title of her talk was "Developing Large-Scale Science Assessments Beneath Storm Clouds of Academic Controversy and Modern Culture Clashes," which really means “Developing Large Scale Science Assessments.” Linda has been a consistent proponent of content validity research and her lecture was elegant, simple, and to the point—and of course, based on content. You may recall a previous TrueScores post that reexamined the “content vs. consequential” validity arguments that parallel the points made by Dr. Crocker, though she made a much more elegant and logical argument in three parts:

Assessments may survive many failings, but never a failing of quality content;

Evidence of “consequential validity” should not replace evidence of content validity; and

Bob Ebel was correct: psychometricians pay too little attention to test specifications.


First, we as a profession often engage in countless studies regarding equating, scaling, validity, reliability, and such. We perform field test analyses to help remove “flawed items” using statistical parameters. How often, however, do we suggest that “content must rule the day” and compromise our statistical criteria in order to get better measures of content? Should we? Second, do we place items on tests to “drive curriculum” (consequential validity evidence), or do we place items on an assessment to index attainment of content standards and benchmarks (content validity evidence)? If not for content validity, why not? Finally, how often have we done what Professor Ebel suggested with his notion of “duplicate construction experiments” (Ebel, R. L., 1961, Must All Tests be Valid? American Psychologist, 16, 640-647)? In other words, have we ever experimentally tested a table of specifications to see if the items constructed according to these specifications yield parallel measures? Why not?

It seems to me that we have taken “quality content” for granted. This has also added “fuel to the fire” regarding the perception of testing in general (and high stakes testing in particular) being watered down and trivializing what is important. Perhaps I am wrong, but instead of spending hours writing technical manuals full of reliability data, item performance data, scaling and equating data, we should spend at least as much time justifying the quality of the content on these assessments. Often we do, but do we do it consistently and accurately? I have seen 1,000-page technical manuals, but such documents justifying content always seem to be locked away or “on file.” Why?

Tuesday, October 04, 2005

University of Maryland Conference on Value Added Models

Last year, the University of Maryland held a conference on Value Added assessment models which was timely, relevant and informative for educators contemplating both growth modeling as well as value added assessment. In fact, the conference presentations were collected together and put into a book to memorialize the conference as well as the shared wisdom (Conference Proceedings 2004) as edited by the conference organizer Dr. Robert Lissitz.

Because of the success of the conference last year and due to the continued discussion regarding growth modeling, value added assessment and AYP (even the FED are considering the use of growth models in NCLB accountability), the conference is on again. While I doubt many of you will plan to attend, please see the conference agenda so you can keep up on the research associated with growth models (2005 Maryland Conference on Value Added Models) and look for the proceedings of this 2005 conference.

Wednesday, September 21, 2005

The Next Revolution in Item Format

The last revolution in the format of how Americans test students may have been during and just after World War I. That’s when the multiple-choice format replaced the essay format as the most prevalent item type in educational testing. In the decades that followed, the techniques around multiple-choice testing were developed and refined.

Let me put on my futurist hat for a moment and predict that the next revolution in the format of how Americans test students will happen when the computer-delivered simulation replaces the multiple-choice format as the most prevalent item type in educational testing. I’m not going too far out on a limb because the computer-delivered simulation is already well established and widely available on the web. For example, the Rice Virtual Lab in Statistics offers a set of simulations to demonstrate various statistical concepts. Or visit the business simulations offered by Forio Business Simulations. Some nice simulations in biology are offered at Biology Labs Online, by Benjamin Cummings, an imprint of Pearson Education.

Simulations can be used as items in at least two ways. First, simulations could be used like reading passages are currently used, as the stimulus preceding multiple-choice, short- and extended-response questions. Second, simulations can be used as the test “items.” Under this use of simulations, the term item is interpreted broadly to be a situation contrived to produce student behaviors or performance that are revealing of the constructs we are trying to assess. A conventional multiple-choice item is written to produce a student marking a bubble on an answer sheet in such a way as to reveal something about the student’s understanding. A simulation would be constructed to produce a student interacting with the software in such a way as to reveal something about their declarative and procedural knowledge.


To me, the assessment possibilities are in the use of simulations as the test “items.” Using simulations similar to those found in the Rice Virtual Lab in Statistics, I could ask students to create a distribution of cases and demonstrate the influence of outliers on the mean. By capturing mouse clicks, I can collect information about students’ understandings of the concepts of distributions and means. Alternatively, CardioLab, part of Biology Labs Online, allows students to measure arterial pressure and to manipulate five variables that effect arterial pressure. Vessel radius and heart rate are two of these variables. A test question would be: “Take as many measures as you need of arterial pressure under conditions that demonstrate the interaction of vessel radius and heart rate.” I can capture mouse movements to determine the level of understanding of the concepts of variable interaction and experimental control.

Even with my futurist hat on, I can’t foresee the psychometrics that will develop and be refined around simulations. Some techniques are under development, but that's a topic for another entry.

Monday, September 12, 2005

In The News: Electronic Testing

I read with interest the claims made in Oregon recently that online testing actually caused a raise in student test scores: Online Testing Helped Raise Scores. “Oregon students of all ages showed across-the-board improvements on state tests in core subjects....” Good for Oregon and good for electronic testing! I am a big believer in online assessment. Further reading, however, caused me to pause. I began wondering about the security of the online assessments in Oregon. An Oregon official is quoted as claiming, “Web-based testing is more secure…” and “…students and teachers get immediate results…delivered automatically....” All of this raised concerns about security. Surely, someone was looking at the results before they were returned? What evidence does Oregon offer to support their claims of security? Apparently, teachers had multiple opportunities to test their students “throughout the school year!” “Often, teachers will space out the testing periods over a school year and use early test results to evaluate which material a student needs to focus on in order to reach grade-level proficiency.” Does this mean the forms are exposed the entire school year?

Being a skeptical reader, as you should be, and looking for evidence to support their claims, I went to the Oregon Department of Education website for information regarding the security of the system and perhaps some technical information. After nearly 30 minutes I gave up. It is not that I don’t believe press releases, newspapers, or other non-peer reviewed information—well, I guess it is because I don’t—but it is not that I don’t believe Oregon educators. I just wish we could have some factual evidence associated with such claims. Only this way will we truly be able to judge for ourselves what works and what does not.

Thursday, August 25, 2005

Designing Assessment Information to Support Teachers

A few years ago at a CASMA conference, Bob Linn made a presentation in which he noted that states faced a challenge in meeting their NCLB goals. The states have accepted that challenge, and they are rolling up their sleeves and getting to work. And they are looking to the educational measurement community to help them in this task. States are asking the educational measurement community to provide teachers the assessment information they need to achieve NCLB targets. In response, PEM has offered the PASeries to help states and districts meet their NCLB goals.

Before we can better help states meet their AYP goals, we must ourselves be able to answer two questions. What is the information teachers can use to meet NCLB achievement targets? And how do we effectively communicate that information? The design and configuration of assessment information that works well for teachers and helps support their work in the classroom, rather than make it more complicated, should be tackled systematically. But design ideas for assessment information are neither obvious nor effective when they are based on psychometric considerations alone. The design and configuration of assessment information that works well for teachers requires understanding how teachers work and what kind of results instructional practices obtain. Neither classical test theory nor IRT addresses teaching and instructional practices. But socio-cultural theory may provide a framework to work out answers to these questions.

What is socio-cultural theory? Typically, socio-cultural theory is associated with Vygotsky and individual learning. Vygotsky maintained the child follows the adult's example and gradually develops the ability to do certain tasks without help or assistance. He called the difference between what a child can do with help and what he or she can do without guidance the "zone of proximal development." However, socio-cultural theory has been transferred and extended to industrial design and product development. For example, the paper by Aula, Pekkala, and Romppainen outlines a research approach to designing successful products by recognizing the end users’ needs and expectations. This approach is part of the National Science Foundation’s research into the implementation of design theory to advance the product realization process. Some of this research is being funded by the Division of Design, Manufacture and Industrial Innovation.

In educational measurement, we can extend socio-cultural theory to the design of assessment information that supports teachers’ work in the classroom and helps teachers meet NCLB achievement targets. Our charge would be to develop theories and produce findings that are pertinent to understanding the design, development and implementation of usable assessment information systems. Such theories and findings would answer questions such as:

  • What kind of instructional practices are best able to take advantage of what kind of assessment information? For example, teachers whose only instructional strategy is to reteach a unit are able to use a different kind of assessment information than teachers who have available different instructional strategies for students with different misconceptions.
  • What kind of assessment information is best suited to inform instruction on what kind of learning? For example, different assessment information might be better suited to inform instruction of a procedure, such as the subtraction of multi-digit numbers, than the assessment information that is better suited to inform the instruction of conceptual understanding, such as the structure of the U.S. government.

We as educational measurement professionals have much work to do before we can identify a teacher’s “zone of instructional development” for assessment information. But we cannot give educators the same kind of response as Henry Ford gave to car buyers when asked what color cars were available: “Any color – so long as it’s black.” If we hope to help educators improve childrens’ learning, we must be able to design assessment information by recognizing teachers’ needs and expectations. And educators are pleading for, even demanding, this kind of information.

Tuesday, August 23, 2005

Reading IS Fundamental

The Second Annual Lexile© National Reading Conference is in the books...no pun intended. I attended, as I did last year, and was once again impressed by what I discovered. Sure, Greg Cizek's presentation about "Testing Myths" was enjoyable and informative. My presentation criticizing NCLB not allowing "off level" reading assessment was novel if not interesting (though it was well attended). Quality Quinn, Lou Fabrizio and Malbert Smith all provided very informative and instructional presentations. All of these were worth the price of admission alone. However, what impressed me the most was the desire of the attendees to read! Teachers were buying books, with their own money, to give to "troubled readers" in their classrooms. Malbert Smith talked about that "parasite" we have in our homes, the television, that robs us of intellect. Reading teachers agreed that the best way to teach reading was to get children to read. Assessment developers understood the needs of the reading specialists! In short, it was utopia. Well, short of utopia, it was very exciting to see people paying attention to reading and reading instruction. I hope you can attend next year, but in the mean time...pay attention to reading.

Monday, August 08, 2005

Is AERA Too Big to be Useful?

Recently, I attended the CCSSO Large-Scale Assessment (LSA) Conference held this June in San Antonio, Texas. On the airplane ride home, I had a moment to compare the experiences of the two conferences I have attended in 2005: the CCSSO LSA conference and the AERA conference held this April in Montreal, Canada.

Take, for example, the access to sessions. The CCSSO LSA conference was held in one hotel, and all the presentation rooms were on one floor. I was able to leisurely stroll from session to session, and rooms were easy to find. I was able to attend nearly all the sessions that attracted my interest. And, I didn’t once have to leave the air-conditioned comfort of the hotel.

The AERA conference was spread across a handful of hotels, located blocks away from one another, plus having multiple conference rooms on different floors. I had to run from hotel to hotel, and struggle to learn multiple layouts of floors and rooms. I missed many of the sessions I wanted to attend because they were too far from the last session I attended, or because they were scheduled at the same time as another session. On top of that, I walked blocks and blocks in the cold and the rain.

As another example, consider the interaction with colleagues. The CCSSO LSA conference had a number of opportunities to talk with colleagues. Because the conference was held in one hotel, I constantly crossed paths with colleagues between sessions. Furthermore, I could easily arrange to meet friends and colleagues in the mornings or evenings because nearly all of us were staying in the conference hotel. In addition, at least one reception was held every night, providing a relaxing atmosphere in which to meet and talk.

The AERA conference was attended by more of my colleagues but I crossed paths with fewer of them. I rarely crossed paths with colleagues and often, when I did, it was in the middle of a crosswalk as I ran from session to session. Friends and colleagues were scattered across the city at different hotels sometimes miles apart. It was difficult to find people and more difficult to arrange meetings.

Not surprisingly, I enjoyed the CCSSO LSA conference more than I did the AERA conference. I saw more of who and what I wanted to see. I did so in a relaxed and comfortable environment. I don’t intend this as a rap against either Montreal or the AERA staff. Montreal is a great city, and the AERA staff are always pleasant and hard working. But the AERA conference has grown so large that it has out grown being a meeting for professional growth and exchange. An alternative should be considered that is more intimate, perhaps more like the CCSSO LSA conference in size and format.

Thursday, August 04, 2005

Is It Fact or Process?

A friend of mine and I were recently discussing some aspect of mathematics instruction. I wanted to talk about the "number line" and he wanted to talk about "math facts". Perhaps I was being a bit ornery (quite contrary actually), but I looked at him with a blank stare and asked what he meant by "facts." He said, "You know...facts, like two times two is equal to four." Since I started down this path, I continued. So I replied, "Well, actually, two times two is really a concept. The concept of the successive addition of two for a total of two cycles." He became quite agitated and said, "No! It is a fact, you either know it or you don't." That is when I drew a matrix of 1-9 across the top of a piece of paper and 1-9 down the side and showed him how this matrix provided the "facts" he claimed without really "knowing" anything (other than how to draw the matrix). My friend then noticed that I was trying to teach him about process and he was trying to teach me about facts, and that we were getting no where fast. As such, he changed the topic to history, which I am sure he thought was a safe subject. "Math is no different than history," he said. "It's all about knowing the facts and sequencing them correctly." I said, "Really? Then if you list the League of Nations before the United Nations on some timeline you have demonstrated knowledge of history?" My friend was skeptical (and annoyed) and did not answer. I told him that in reality, it might very well be important to understand the impact the League of Nations had on the development of the United Nations if you were going to use history to help understand current events and/or future events.

At this point we decided to end the conversation before anyone got really mad. In departing, he did take one last shot. He said, "It's just like with testing...all you have to do is figure out if the kids know the facts." And I asked him, "Perhaps, but what process do you want to use? Multiple-choice, short answer, essay...?"

Perhaps the next time I see my friend we will talk religion...it will likely lead to a simpler discussion!

Monday, August 01, 2005

Why not make standard setting scientific?

The number of different standard setting methods has proliferated over the years. Research has focused on evaluating current standard-setting methods, improving these methods, and discovering which methods are suitable for different situations. See the 1996 book, Setting Performance Standards by Greg Cizek for a review of most of these methods.

In the fourteenth century, William of Ockham noted: "Pluralitas non est ponenda sine neccesitate'', which translates as "entities should not be multiplied unnecessarily.'' Any casual observer would certainly note the multiplication of standard setting methods, though the judgment of being unnecessary remains to be made.

I have noticed that proponents for or opponents against standard setting approaches have used intuitive explanatory models of how standard setting judges think as an Ockham’s razor to appraise standard setting methods. Critics have used intuitive explanatory models of judges’ thinking to argue against the use of some standard setting methods. For example, the National Academy of Education used intuitive models of judges’ thinking to argue against the use of the modified Angoff method. The NAE made the claim—that can only be based on a model of how judges think during standard setting—that estimating the probability that a borderline test taker will answer an item correctly is a task that is too difficult for judges to do effectively. This claim was one source of support for the conclusion that the modified Angoff method was fundamentally flawed.

Alternatively, proponents have used intuitive explanatory models of judges’ thinking to argue in favor of the use of other standard setting methods. For example, Impara and Plake in the Journal of Educational Measurement made the claim that 1) judges may have difficulty conceptualizing hypothetical test takers, and 2) judges may have difficulty estimating proportion correct. Like the NAE’s claims, these claims can only be based on assumptions of how judges are thinking. These claims were used as rationale for proposing and testing two variations in the way the Angoff method is typically applied.

Arguments around standard setting methods seem always to depend on intuition because no formal explanatory models of judges’ thinking is out there. So, someone arguing for or against any standard setting approach has no shared, public criterion that might serve as a foundation for criticisms or acclamations. Standard setting research and practice has no Ockham’s razor against which to judge standard setting methods. Why not?

An earlier post to this blog (Those Pesky Performance Standards, Friday, May 20, 2005) noted that standard setting resulted in arbitrary, but not capricious, judgments, and mused that all of the research and rhetoric using the results or outcomes of such judgmental procedures may not be worth the efforts. But the validity of standard setting results are in the procedure, not the results. We should understand how judges think during these procedures and not just go on hunches.

Why do educational researchers keep making these claims about how standard setting judges think, but fail to do the the research to support a scientifically based model of how standard setting judges really think? Models of how people think have been proposed and tested in many other areas, why not standard setting?

Monday, July 25, 2005

The Consequences of Consequential Validity

Recently I was in a meeting with some of our leaders in educational measurement when a discussion regarding how to present consequential validity evidence ensued. Having received most of my instruction, readings and otherwise general nurturing from several "disciples" of Bob Ebel, I was fascinated. I recalled the 1997 Educational Measurement: Issues and Practice volume (Vol. 16, No. 2, Summer) dedicated to this topic. An oversimplification of the debate, both then and now, seems to be around the definition of consequential validity itself. Bill Mehrens, in his 1997 article, articulates:

"I suggest that the psychometric community narrow the use of the term validity rather than expand it. Let us reserve the term for determining the accuracy of inferences about (and understanding of) the characteristic being assessed, not the efficacy of actions following assessment."

Professor Mehrens continues:

"The consequences of a particular use do not necessarily inform us regarding either the meaning of a construct or the adequacy of a particular assessment process in measuring that construct."

This group of measurement experts, to which I referred in the opening, debated such statements (and others) for quite some time. Having no resolution, but acknowledging that the consequences of the use of scores resulting from an assessment are important, they recommended that we follow the Standards regarding presentation of consequential validity evidence.

For the life of me, I can find no reference to general "consequential validity" in the Standards at all. I can find some references to "unintended consequences," but the notion of general consequences for test score use is not specifically addressed. Perhaps, even if it is only a memorialization of the debate, the next edition of the Standards will include it.

Monday, June 20, 2005

Formative Assessments Link Measurement and Instruction

Many presentations at the annual CCSSO conference on Large Scale Assessment, taking place here in sunny San Antonio, reference the need to expand testing to include more formative assessments.

Pearson Educational Measurement has our own set of formative assessments know as PASeries. A white paper describing how PASeries was developed and how it might be used to improve student learning, as well as other information regarding PASeries, are available here at the conference and are quite popular. Check out PASeries and see for yourself.

Wednesday, June 15, 2005

Models, Measurement and Learning

For years, the measurement community has debated which of a virtually limitless number of mathematical models is most appropriate for a given measurement activity. I remember the very heated discussion between Ben Wright and Ron Hambleton at an AERA/NCME conference not too long ago. Ben spoke of "objective measurement." Ron spoke of "representing the integrity" of measurement practitioners. Both sides had their points and the debate still continues in some circles.

In this age of "standards referenced" assessment, the selection of a measurement model might not be an academic debate only. Think for a minute about a standards referenced test where six of the 60 items come from a particular content domain (say Mathematical Operations). For the teacher, student and accountability folks, this means that 10% of the emphasis of the test is on operations (6/60 = 10 percent). However, measurement practitioners know that by selecting an IRT measurement model (other than Rasch), each of these items are weighted based on the slope parameter. Pattern scoring will then essentially guarantee that 10% of the test is not assigned to mathematical operations. This is because the contributions of these items to the total measure (the resulting theta value) will be weighted by the discrimination (slope) parameter. So, if the operation items do not do a good job in discriminating between high and low overall test scorers, these items are likely to contribute far less than expected to the total ability measure. While this might not be as big a deal when number correct scoring is used, the effects of this weighting are still present in research (equating and scaling).

I think we as measurement experts need to be cognizant of some distinctions when debating psychometric issues. First there is the psychometric or mathematical aspect: Which model fits the data better? Which is most defensible and practical? And which is most parsimonious? Often, I fear, psychometricians decide before seeing the data (with almost religious zeal) which model is "correct." The second aspect is one of instruction: Are we measuring students in the way their cognitive processes function? Are we controlling for irrelevant variance? And do our measures make sense in the context? Often, I think, psychometricians are too quick to compromise on measures without fully understanding constructs. Finally, we need to consider the learning aspect: Are we measuring what is being taught in the way it is being taught or are we doing something else? Are we as psychometricians measuring what is being taught, the stated curriculum, or something else (speededness for example)? Without considering these aspects, at a minimum, we are likely to argue for mathematical models that might not be helpful for our mission of improved student learning.

Just one man's opinion...

Thursday, June 02, 2005

Testing by Computer

For the K-12 arena, testing by computer is the future and the future is now. Many schools are ready for and welcome this use of technology. Most kids consider taking a test by computer to be much less involved than downloading the latest iPods tune. But some schools aren’t ready at all. And some kids haven’t used the computer much at all. So traditional testing isn’t going away quite yet. True multiple choice testing.

Not that some sticky issues aren’t raised when high stakes testing programs are offered online. Professional testing standards are pretty clear that computer tests and paper tests must be shown to be comparable if they are to be given together. But how do you show this? Does every test administration become an experiment? The trick is to remain faithful to the standards without creating barriers to a natural innovation. Training, good customer support, and creative data analysis go a long way. Design a strong experiment if you can, collect the most relevant data possible if you can’t. The current literature on comparability is a little mixed. Some studies show some effects; others not. Most can be criticized for some design flaw or another. Not to worry, though. The schools will soon make clear what the most popular choice is.

Wednesday, May 25, 2005

Those Pesky Performance Standards

I am never failed to be amazed by the discussions my colleagues and I engage in regarding what psychometricians call "standard setting". The essence of standard setting is to determine "how much is enough" regarding the performance on some measure, and to do so in a less than capricious manner (still arbitrary, but not capricious).

Nevertheless, rooms filled with content experts, testing experts, psychometricians (some of whom are experts), standard setting experts, and others engage in countless banter regarding how to plan for, control, and analyze the data resulting from (or going into) a standard setting as if the data was anything less than an arbitrary (though often not capricious) judgment.

Perhaps I am finally too old to enjoy such arbitrary distinctions anymore. Understand that I am not saying that standard setting is not important, that the established procedures should not be used or that we should not carefully plan and implement the standard setting in the best way possible following standards of best practice. I think all of this should be done. I am just not sure all of the research and rhetoric using the results or outcomes of such judgmental procedures are worth the efforts they require to discuss.

One person's opinion...of course.

Friday, May 20, 2005

Life Long Learning

Over the years, I can recall various conversations regarding student growth, preparedness and remediation. They go something like the following:

First Grade Teacher: These kids have no social skills at all. Why can't the parents do more to get their kids ready for school?

Third Grade Teacher: These kids don't know the alphabet or their math facts. Why can' t the earlier grade teachers do more?

High School Teacher: These young people don't have any of the prerequisite math skills. Why can't the middle school teachers do more?

College Instructor: Half of our entering freshmen are in remediation. Why can't the high school teachers do more?

Educators and the public alike often talk about a K-16 or K-20 system of education in this country. In fact, just last week a retired professor of mine talked about being a "life long student" and how the biggest pleasure he gets in life is the fun in finding things out. Yet, our educational systems are quick to "pass the blame" onto what has gone before. It seems to me that a more integrated system of learning, including measurement of skills from K-20 might make it easier to debunk (or at least put into perspective) the gaps students have in their pre-requisite skills as they move from kindergarten to college.

One interesting step in this area is the use of "college readiness" indicators as part of the state mandated assessment system. Texas has recently required such an indicator.

Preliminary results of the research supporting this effort (as conducted by Pearson Educational Measurement in coordination with the Texas Education Agency) is also presented.

Tuesday, May 03, 2005

How It All Started

T = X - E


Recall that one of the fundamental derivations of "strong true score" or "classical" measurement theory is that an examinee's unknown and unseen "true score" (T) is really their observed score (X) on an assessment minus error (E). Since the development of this concept (and even before) measurement practitioners and theorists to boot have been trying to estimate a student's true score with greater and greater precision. This maximization effort typically focuses on ways to partition the error (i.e., to better understand what is causing error) and ultimately reduce it such that observed student performance is a better indicator of underlying achievement or ability.

So, what does all this have to do with the TrueScores blog? Only in that it serves to mention that Pearson Educational Measurement has recently expanded our research efforts and intend to use the TrueScores blog as one of the forums for dissemination and debate. The last thing the world needs is another forum for a pompous psychometrician to pontificate about how the world would be a better place if y'all would only buy their solution. To this end, the TrueScores blog is dedicated to honest, respectful, scientifically based and open debate about the "hot" topics in today's measurement world. Some of these topics include:

  • Establishing comparability between paper-and-pencil assessments and their online or electronic counterparts.
  • Automated essay scoring: Is it practical, reliable and valid?
  • Is Computer Adaptive Testing (CAT) a potential solution for the age old question of testing time versus instructional time?

Background information related to these topics can be found at our web site on the research pages. Additional publications related to a host of topics in educational measurement can also be found at our web site. Future research will be added periodically and we will use this blog to communicate these additions.

We will be updating this blog so that a new discussion topic will be posted regularly. This will add to the debate shaping our educational policy and will provide practical and applied insights into not only classical measurement but other aspects of educational measurement including Item Response Theory, Growth Modeling (Value Added Models), Equating, Scaling and legal defensibility. As such, we hope you return.

In the mean time, if you have questions about Pearson Educational Measurement, or our parent company Pearson Education, start by visiting our home page.