Wednesday, November 22, 2006

What Are We Trying to Measure? IPC and Compensatory Models

Two debates were of note at the Texas State Board of Education (SBOE) this week. There were more, but I only want to talk about two.

The first was the discussion about curriculum and course sequencing. The SBOE increased the rigors of both mathematics and science curriculum by requiring for graduation four years of math (including Algebra II) and advanced science courses. I applaud these efforts as research is quite clear that college readiness depends upon taking a rigorous set of high school courses. (For example, see the ACT Policy Paper: Courses Count.) Go figure! What I don’t understand is a science course called “Integrated Physics and Chemistry” (IPC). This course is a “survey” or “introduction” course that in one year teaches both physics and chemistry, with the idea being that students take a biology course, the integrated course, and then one of either chemistry or physics. This sounded odd to me, so I asked an expert—a teacher. She told me that physics and chemistry are really difficult (not surprising, I remember them being difficult when I was in school) and that IPC was an instructional "warm-up" or instructional bridge to physics and chemistry such that later on the students would be better prepared to take physics and/or chemistry. This logic has a degree of sense to it, but I wonder: Would it not be better to spend the time giving students the pre-requisite skills needed to be successful in physics and chemistry during their school career instead of defining a “prep course” to take prior to their engagement? Perhaps, but I am smart enough to listen to a teacher when she tells me about instruction, so I will take a wait-and-see approach. Stay tuned.

The second debate was really an old issue regarding compensatory models. The argument is simple from a “passing rates” perspective. Namely, students are required to pass four subjects by performing at or above the proficiency standard for each subject. If you allow a really good score in one area to compensate for a low (but not disastrous) score in another area, more students will pass. There were passionate and sound (and not so sound) arguments to use a compensatory model for this purpose (i.e., increased passing rates). Being a psychometrician, however, it is hard for me to reconcile this logic. Why should, on a standards-referenced assessment, good performance in say social studies compensate for poor performance in say mathematics? I claim that instead of muddying the waters with respect to the construct being measured, and if increased passing rates are the goal, do something to enhance learning or lower the standards. I am sure this last claim would get me run out of town on a rail. Yet, the notion of compensating for poor math performance via one of the other subject areas was a legitimate agenda item at the SBOE meeting. Again, go figure!

Thursday, November 09, 2006

Educational "Case Control" Studies?

I attended the American Evaluation Association conference last week in Portland, Oregon. The weather was typical—raining again—but the conference inspired some creative thoughts, particularly on large-scale data research. My colleague from Pearson Education, Dr. Seth Reichlin, presented on efficacy research regarding instructional programs (e.g., online homework tutorials) and suggested that, perhaps, a better way to conduct such research was via "epidemiological studies." He went on to reference the usefulness of the massive database of educational data at the University of Georgia:

Integrated data warehouses have the potential to support epidemiological studies to evaluate education programs. The University System of Georgia has the best of these education data warehouses, combining all relevant instructional and administrative data on all 240,000 students in all public colleges in the state for the past five years. These massive data sets will allow researchers to control statistically for instructor variability, and for differences in student background, preparation, and effort. Even better from a research standpoint, the data warehouse is archived so that researchers can study the impact of programs on individuals and institutions over time.
The easiest way to think about this is via "case-control" methodology often used in cancer research. Think about a huge database full of information. Start with all the records and match them on values of relevant background factors, dropping the records where no match occurs. Match on things like age, course grades, course sequences, achievement, etc. Match these records on every variable possible, dropping more and more records in the process, until the only variables in the database left unmatched are the ones you are interested in researching. For this example, lets say the use of graphing vs. non-graphing calculators in Algebra. Perhaps, after all this matching and dropping of records is done, there are as many as 100 cases left (if you are lucky to have that many). These cases are matched on all the variables in the database, with some number (lets say 45) using calculators and the others (the remaining 55) taking Algebra without calculators. The difference in performance on some criterion measure—like an achievement test score—should be a direct indicator of the "calculator effect" for these 100 cases.

The power of such research is clear. First, it involves no expensive additional testing, data collection or messy experimental designs. Second, it can be done quickly and efficiently because the data already exists in an electronic database and is easily accessible. Third, it can be done over and over again with different matches and different variables at essentially no additional costs and can be done year after year such that true longitudinal research can be conducted.

In the past, databases may not have been in place to support such intense data manipulation. With improvements in technology, we see data warehouses and integrated data systems becoming more and more sophisticated and much larger. Perhaps now is the time to re-evaluate the usefulness of such studies.

Thursday, November 02, 2006

Growth Models Are Still the Rage

I was fortunate enough to be able to attend a Senate Education meeting this past month; and I got to hear Bill Sanders articulate the virtues of value-added models and how they differ from growth models. While I did not agree with all the things the good Dr. Sanders reported—I found his arguments over-simplified—I have no issues with either growth models or value-added models. I do worry though that few people, particularly politicians, lobbyist and legislators, have had a chance to really think about and understand the differences and/or the implications of selecting and using such models, particularly in large-scale accountability programs. For example, Lynn Olson reported recently in Education Week that growth models, via the USDOE pilot program in 2006, do not help much:

“But so far, said Louis M. Fabrizio, the director of the division of accountability services in the state department of education, ‘it's not helping much at all.’

‘I think many felt that this was going to be the magic bullet to make this whole thing better, and it doesn't,’ he said of using a growth model, ‘or at least it doesn't from what we have seen so far.’”
In Tennessee, the home of value-added models, the picture is not much different according to Olson's report:

“In Tennessee, only eight schools' achievement of AYP was attributable to the growth model said Connie J. Smith, the director of accountability for the state education department. Tennessee uses individual student data to project whether students will be proficient three years into the future.

‘I was not surprised,’ Ms. Smith said. ‘It's a stringent application of the projection model.’

Despite the few schools affected, she said, ‘it's always worth doing and using a growth model, even if it helps one school.’”

Understanding complicated mathematical or measurement models is a long row to hoe. General confusion is likely to be enhanced with the great amount of research, reviews and press talking about gain scores, growth models, vertical scales, value-added models, etc. Hence, PEM has been trying to simplify some of the misunderstandings regarding such models via our own research and publications. Check out a recent addition to our website: An Empirical Investigation of Growth Models for a simple empirical comparison of some common models.

“This paper empirically compared five growth models in the attempt to inform practitioners about relative strengths and weaknesses of the models. Using simulated data where the true growth is assumed know a priori, the research question was to investigate whether the various growth models were able to recover the true ranking of schools.”

Tuesday, October 24, 2006

Discovery: Air Travel to Blame for Being Left Behind!

I have discovered the cause of declining achievement in the United States: it's air travel and I have proof!

Observation:
Despite the warnings to arrive at the airport early, and despite the schedule updates and the itineraries provided electronically by the airlines, I observe many people running to the gate.
Conclusion:
People who are 30 minutes late claim it was due to the longer than expected check-in line (maybe five minutes longer), the longer than expected security line (maybe 10 minutes longer) or anything other than the truth—people who travel by air can no longer tell time! It must be true as so many are so late.
Observation:
Despite several signs posted near security, despite references listed on travel agent and airline websites, and despite ample newspaper coverage, people still leave their shoes on, attempt to carry "liquids or gels" through security, don't take their laptop out of their carry-on bag, forget to hold their boarding pass, leave their cell phone in their pocket, etc.
Conclusion:
People who fly can no longer read! How else do you explain it?
Observation:
A "little elderly lady" schleps a huge roller bag on board, plus one large carry-on bag and one gigantic purse. Said lady has a dickens of a time getting it into the overhead bin. When the flight attendant helps her and asks her why she didn't check the biggest and heaviest bag, she replies that she hates to wait so long at baggage claim to retrieve it.
Conclusion:
People who fly have lost their common sense, patience, and suffer from diminished intellectual capacity! They perhaps also suffer from the delusion that others will help them with their bag and that the space above or below their seat really belongs to them!
Observation:
A group of passengers waits to board the plane or waits for their baggage after the flight. The closer to departure the closer they creep to the door; or once the buzzer sounds the closer they inch toward the carousel. When it is time to depart or time to get their bags, no one can get in efficiently because of the crowd.
Conclusion:

Air travel causes humans to behave like livestock where each follows the lead such as not to be left out of whatever the lead sees or gets!

Think I'm wrong? Check it out the next time you get to the airport. Chances are you have displayed one or several of these behaviors yourself. But I guess if that were true you would not have the time to read this post, nor the ability to really read the words, nor the patience to read it to the end, where before which you would have become distracted thinking about what others who have already read it are doing.

Humm...I wonder if I can get to LA by way of Omaha?

Monday, October 16, 2006

Big Government

I am a big believer in free enterprise. I thought, until recently, that the one thing that is consistent with Republican theology is limited government. Then I read what Secretary of Education Margaret Spellings said about her views on American colleges in Newsweek. Basically, the Secretary is calling for making:
"...higher education more accountable by opening up the ivory towers and putting information at the fingertips of students and families."
The Secretary claims it is all but impossible for families to compare schools and make an intelligent choice because there is a lack of comparable information. According to the Newsweek article:
"Spellings and others would like a national database that discloses things like graduation rates, how well students are educated, and how much they earn afterward."
This sounds like "Big Government" to me. Only the indicators deemed "appropriate" would be used to "rank" the colleges and, all of a sudden, free enterprise (defined as my ability to choose from a mix of perceived value via offsetting quality and price) becomes forced choice. If, as the Secretary points out, "consumer demand is a big part of this" (i.e., her desire to make information available), then let the free market dictate which colleges parents and students choose based on what the market says, not what big government says.

Monday, October 09, 2006

Reading First: A "Scorching" Review by the US Inspector General

An internal review by the US Inspector General found little to celebrate regarding the USDOE's handling of the Reading First program. The report makes some very strong claims:


The selection of the review panel violated the law because each application was not reviewed by the appropriate panel.
The USDOE substituted a department-created report in place of the panels comments.
State applications were forced to meet standards not required by law.
The review panels were not representative of the agencies authorized to do the review; and the majority of the reviewers were actually nominated by the USDOE.

While the report is very critical of the USDOE in many regards, the report finds (among other issues) that the USDOE did not follow its own guidance for the Peer Review process.

Now, I don't know about you, but I have been through several statewide Peer Reviews for the AYP/NCLB assessment compliance and found those to be very inconsistent. We followed a "black box" process where we provided information going in (with little knowledge of how it was going to be used). Coming out of the "box" was compliance or noncompliance. I have often wondered what went on inside that black box; and the Reading First review makes me even more curious.

Perhaps the Inspector General should consider auditing or reviewing the AYP/NCLB (Title I) Peer Review process, if for no other reason than to open this black box. Or is that Pandora's box?

Friday, September 29, 2006

Learning, Spending, Local Control & National Standards

William J. Bennett and Rod Paige, both former education secretaries, wrote an editorial for the Washington Post (Thursday, September 21, 2006) that seemed to be a bi-partisan call for a national achievement test, which may have been a mea culpa, but is in reality direct evidence showing that some politicians do not understand the needs of education.

Let me be specific. They start with the premise that Americans "ultimately educate themselves" (presumably regardless of schooling or based on the assumption they learned little if anything in school):

"Americans do ultimately get themselves educated—at work, after school, online, in adulthood—but a lot of time and money are wasted in the process."

I was just thinking about this when I could not recall the trigonometric formula for calculating the correlation coefficient from the angle between two vectors on a plot. So, I asked my engineering friend, who does this sort of thing for a living, but he did not know. I looked it up (in a school text book from a course I had) and later discovered the only people I asked who seemed to know were mathematics educators.

I doubt most people are likely to learn about such a thing "at work, after school, online, in adulthood," because it is hard, multidimensional, and has a complicated context. Yet people speak of correlations all the time, with authority, like they know what they are talking about. Suggesting people can learn the skills needed to compete in our new technology age without schooling is preposterous and is likely to lead to the "dumbing down" of the needed deep understandings so wanted by Bennett and Paige.

Bennett and Page also claim that:

"Ever since the Commission on Excellence in Education declared in 1983 that America is 'at risk' because of the lagging performance of its schools, this country has been struggling to reform its K-12 system. The education 'establishment' has wrongly insisted that more money (or more teachers, more computers, more everything) would yield better schools and smarter kids; that financial inputs would lead to cognitive outputs. This is not so."
Well, my grandmammy (who taught farm kids during the depression) constantly reminded me growing up that the primary purpose of school is to "socialize youngsters" such that they could fit into society. I maintain that education in the U.S. has been evolving and struggling forever and that things like Sputnik, "Nation at Risk," and NCLB are just mileposts on this journey to evolve and improve education as our surroundings change.

Medical doctors spend money in training, technology, research, and improved facilities. Auto manufacturers, McDonald's and others spend money on the same. Why do the former secretaries not think investments in education are important given our changing environment? Particularly with fast paced changes in technology and the lack of respect and the lack of pay that are driving many of our teachers away. The notion that education is independent of need for more teachers, more computers, more money, more everything creates an alternate reality where fantasy reigns.

Their next points I disagree with but clearly they have inside information so perhaps I am wrong:

"But there's a problem. Out of respect for federalism and mistrust of Washington, much of the GOP has expected individual states to set their own academic standards and devise their own tests and accountability systems."
I thought education was a local control issue. I thought taxpayers in the states provided the funds (including federal, state and local) to support education and, therefore, wanted control of these decisions locally. As such, I thought taxpayers in Minnesota might want to tailor their education plans to fit the needs of their state (since they were paying for it), and Iowa might want to do the same. Perhaps you think Iowa and Minnesota have good education programs but kids in Mississippi and Alabama don't get a good education. Perhaps, but this is why the people in these latter two states pay taxes and elect officials—so that they control what is done to improve. This is why I thought states were setting their own standards (both content and performance).

Finally, the former secretaries would have us believe that the states have subverted real academic content and performance standards in order to keep the passing rates high. They suggest the differences between statewide passing rates and NAEP passing rates are the evidence. They draw the conclusion that "Washington should set sound national academic standards and administer a high-quality national test."

I simply reiterate some of the points I have made in this venue and others before. First, why would we expect two tests (state NCLB assessments and NAEP) constructed for different purposes to yield the same results? They measure different content; they have different samples of students; and they have different motivation. Second, who in the world thinks Washington can set standards and maintain high quality assessments? Talk about federalism. The standards and the assessments would be at risk of being compromised to fulfill political agendas; and the rhetoric would be the same.

One man's opinion.

Thursday, September 21, 2006

Educational Measurement, Fourth Edition

Educational Measurement, Fourth Edition (2006). National Council on Measurement in Education, Robert L. Brennan (ed.).

From the publisher:
"Educational Measurement has been the bible in its field since the first edition was published by ACE in 1951. The importance of this fourth edition of Educational Measurement is to extensively update and extend the topics treated in the previous three editions. As such, the fourth edition documents progress in the field and provides critical guidance to the efforts of new generations of researchers and practitioners. Edited by Robert Brennan and jointly sponsored by the American Council on Education (ACE) and the National Council on Measurement in Education, the fourth edition provides in-depth treatments of critical measurement topics, and the chapter authors are acknowledged experts in their respective fields.

Educational measurement researchers and practitioners will find this text essential, and those interested in statistics, psychology, business, and economics should also find this work to be of very strong interest."
Seriously, while this edition is a bit pricey, it is likely to be even more important to the field of psychometrics and educational measurement than the previous editions. To order, visit the publisher online.

Thursday, September 14, 2006

NCLB & College Readiness: Aligning Standards, Aligning Goals

Given all this, I find it surprising that now the emphasis of the national debate and rhetoric regarding assessment has turned to “college readiness” and “high school reform.” In fact, many predominate national policy people that I speak with have stated that being successful in college is the “end game.” I don’t disagree that college readiness is important, and I admit there is good research showing that students are not, in general, prepared for success in college (Crisis at the Core). Yet, it seems that we have “thrown out the baby with the bath water” and forgotten our purpose.

Prior to NCLB, most statewide assessment programs referenced core curriculum standards (also called content standards and benchmarks). This was regardless of whether the assessment instrument was a norm-referenced test (NRT) or a criterion-referenced test (CRT). The “standards reform movement,” and to a large extent NCLB itself, refocused these assessments on measuring the content standards and benchmarks each state deemed important for instruction. As such, standards-referenced assessment (SRT) was defined.

Obviously, there has been much debate over the artifacts of these assessments: their rigor, how to establish the passing or performance standards, how states are going to get all of their students over the proficiency standard, etc. However, there has not been much debate on whether the assessments must measure what is being taught in the schools as manifested in the content standards and benchmarks with demonstrated alignment—which, in fact, is required under the Peer Review Process.

I am sure the very content standards and benchmarks we are measuring were not generated with the notion that college readiness was the intended outcome. If they were, then why are there few, if any, required advanced mathematics courses like Algebra II, Trigonometry, or Calculus? Even if we could revamp the core curriculum standards, philosophically, I am not so sure that our stated purpose for public education should be college readiness.

It was best described to me by a disgruntled parent at the local football game last Friday night: “A high school diploma means nothing because everyone gets one regardless of their skill. Soon, a BS degree will be analogous to the high school diploma we earned when we were in school and its meaning, too, will be devalued.” So…is the goal of making all students college ready a good thing? I could not help but think that this attitude is exactly what one might expect if people actually believe that all students are the same and denied themselves the possibility that measurable individual differences exist. Is this where NCLB is leading us? Perhaps that is a topic for another time.

Monday, August 28, 2006

IEREA Poster Submission Deadline Approaching

A reminder to let everyone know that our Iowa Educational Research and Evaluation Association (IEREA) conference is fast approaching and will be upon us before we know it. One of the popular features of the conference is our poster presentation and our paper contest. We need lots of poster and paper submissions to make this part of our conference a success. Please support this part of the conference by forwarding links to this message to potential presenters (graduate students and faculty). The conference theme this year is: Does High School Need to be Reformed? Research Behind the Headlines. The conference itself will take place on Friday, December 8, 2006 at the Sheraton Hotel in downtown Iowa City. However, the deadline for submitting poster/paper proposals is Monday, September 11th!

Iowa Educational Research and Evaluation Association: 2006 Call for Proposals

Iowa educators are invited to submit proposals to present their research at IEREA's annual conference in Iowa City, IA. Proposals from faculty members, graduate students, and education professionals conducting research related to education, specifically this year's theme, are invited to submit proposals. Additionally, individuals involved in school-based or university-school collaborative action research studies, innovative program evaluations, and work related to technical issues of assessment are also encouraged to submit proposals. IEREA utilizes a poster presentation format, designed to foster dialogue among presenters and conference attendees. To maximize interaction during the poster sessions, posters will be displayed in an open space with sufficient room to congregate, browse, and discuss. Refreshments will also be provided during poster sessions. Instructions for displaying research in a poster format will be sent to presenters of all accepted posters. At least one presenter per poster must register to attend the IEREA Conference, and all poster presenters qualify for reduced conference registration fees. Details are provided upon acceptance of the proposal.

The deadline to submit poster proposals is 5:00 pm, Monday September 11, 2006. Submissions must include two copies of the proposal. One copy of the proposal should contain author name(s), institutional affiliation(s), and complete contact information for the coordinating presenter all on a separate cover sheet. The second copy of the proposal should contain no author names, titles, or contact information in order to facilitate blind review of all proposals. The poster proposal itself should be no more than three (3) double-spaced pages (excluding references) with reasonable margins and minimum 11-point type. Each proposal must include the following: Title of Poster, Abstract (maximum 50 words), Goals/Objectives, Design and Methods, Results Significance/Impact, References. E-mail submissions are strongly encouraged (please type IEREA Proposal in the email Subject line), and receipt of proposals will be acknowledged via return e-mail. Send all poster proposals to:

IEREA Conference Planning Committee
ATTN: Dr. Frank Hernandez
N229B Lagomarcino Hall
Iowa State University
Ames, IA 50011
515) 294-4871
fhernand@iastate.edu

Wednesday, August 16, 2006

New Orleans is Alive and Well

It is always dangerous for a "research scientist" to report/comment on non-research topics, but that is exactly what I am doing. The most often asked questions I have received about my recent trip to the annual APA conference in New Orleans are about the condition of the city following Katrina. Most people who ask know little about the Ninth Ward or any of the other areas most heavily hit by the hurricane. But they do know the French Quarter and the Warehouse District with its museums, so these are the areas on which I will comment.

My first impression was simply the lack of people. Even with a convention reportedly 10,000 strong, there were far too few people, anywhere. The airport was all but empty (both coming and going), many gates obviously unused. There were no lines for cabs, no lines for check in at the hotel, no lines at dinner (without reservations, I might add). The National D-Day Museum (since named something else by Congress) was practically empty -- on a Saturday no less. I noticed these things because my last trip to New Orleans prior to Katrina was radically different. Even the bars of Bourbon Street were resorting to that old college-town trick of "three drinks for the price of one," hawked by aggressive tub-thumpers with just a bit too much eagerness in their voices.

There were other reminders of the most recent disaster too, subtle perhaps, but eerily present nonetheless. Wendy's on Canal Street had a sign in the window that read "Now Open Every Day." Perhaps they have been open every day for quite some time, but not so long as to warrant removal of the sign. I went to Radio Shack to see about a new battery for my cell phone (foolish as that might sound) and noticed that merchandise was only now returning to the shelf.

Despite these changes since Katrina, there were many things that reminded me of the old New Orleans. Mother's Restaurant , for example, was the spitting image of what I remembered. Arguably offering the best blackened ham in the U.S., Mother's seemed like the same place I have visited hundreds of times before (over a 20-year span). Wonderful chicory coffee, long lines with loud short orders and home town folk. The decadence of Bourbon Street is still there, if not more expanded, as too is the Old Absinthe House across from the Royal Sonesta Hotel, all stops I make when I am in town.

People wonder aloud if New Orleans is ready to resume its tourist trade and rejoin the convention circuit. Well, I have little pull with AERA, NCME or CCSSO (and perhaps even less influence on the readers of this blog) but I do know these conventions would succeed in New Orleans. The new convention center was a wonderful venue for APA and should be suitable for others as well. New Orleans is still a fun place to visit and the timing is ripe before all the tourists get back to town. I encourage y'all to visit and I will continue to lobby the conference planners as well.

Monday, August 07, 2006

Putting Context in Context

Not long ago, Dr. Robert L. Brennan, a professor and friend of mine, identified what he called the "context" of "context effects." He later published the research in Applied Measurement in Education. Ever since then, I have been intrigued with the notion that it is not the context effects themselves that play havoc with measurement, but rather it is the changing context in which the effects occur that seems to cause the problem.

Recently, a friend forwarded to me an email that again peaked my interest in the role of context in learning and assessment. Here is part of the email:
Can you raed tihs?

i cdnuolt blveiee taht I cluod aulaclty uesdnatnrd waht I was rdanieg. The phaonmneal pweor of the hmuan mnid, aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it dseno't mtaetr in waht oerdr the ltteres in a wrod are, the olny iproamtnt tihng is taht the frsit and lsat ltteer be in the rghit pclae.
I am told that this is the "trick" to speed reading—namely chunking the information at a level of aggregation higher than the letters' words. This phenomenon—being able to read such jiberish—makes sense when you stop and think about it. Context provides much of the things we claim to be "understanding."

Take for example the following sentence segments: "You have a hot car." "It is very hot today." Technically, you can't tell what the definition of hot is without the context. As such, knowing the primary definition of the word "hot" will do very little, if anything, in helping you understand the meaning behind these sentences. This dilema is one of the reasons that reading is difficult to teach. It is also what causes most automated/computerized "readers" to fail.

However, Latent Semantic Analysis (LSA) is a promising, if not new way to use context to allow the computer to actually infer meaning from sentences or other snipits of information in text. As such, the computerized "readers" used in engines like the Intelligent Essay Assessor (IEA) provided by Pearson Knowledge Technologies (PKT) are likely to finally allow the computer to infer meaning from text (student generated or otherwise) at accuracy rates acceptable to the measurement community.

But don't take my word for it, check out the research yourself : Reliability and Validity of the KAT™ Engine.

Monday, July 31, 2006

NCLB Testing: About Learning, Not Standings

The No Child Left Behind Act (NCLB) is perhaps the most sweeping and controversial educational reform ever enacted. As professionals working in the testing industry, we have both benefited and suffered from this legislation. To be sure, there are aspects of NCLB that are less than ideal, but sometimes criticisms of NCLB are taken too far beyond the facts.

Recently, a commentary highly critical of NCLB was published in the Wall Street Journal. The author, Charles Murray, claims that NCLB is “a disaster for federalism” and “holds good students hostage to the performance of the least talented.” Murray cites a report from the Civil Rights Project at Harvard University, which concludes that NCLB has not improved reading and math achievement as measured by the National Assessment of Educational Progress (NAEP). Murray further argues that although many state assessments show decreases in black–white achievement gaps, these decreases are meaningless because they are statistical artifacts based on changes in pass rate percentages rather than differences in test scores.

Murray’s criticism confounds the idea of measuring students against standards (criterion-referenced testing) with measuring students relative to a group (norm-referenced testing). In a norm-referenced system, if scores for white and black students increase the same amount, clearly the gap between the groups is not closing. But Murray fails to understand the basic tenets of standards-based assessment that provide the framework for NCLB. In each state, assessments are built to measure the state content standards, the very same content standards that schools are expected to use in their instruction. In addition, each state sets achievement standards that establish how well students must perform on the assessments to be considered “proficient” and “advanced.”

In a standards-based assessment, if black and white students improve equally, the percentage of blacks achieving proficient will, over time, increase more than the percentage of whites achieving proficient. Murray is correct to say this is “mathematically inevitable.” But, is it meaningless? Is it meaningless that more black students who were not proficient are proficient now? Is it meaningless that more minority students are learning fundamental reading and math skills that they weren't learning before?

By disdaining measuring students against standards and embracing norm-referenced measurements, Murray’s arguments evoke stereotypical assumptions—notably the assumption that there is “a constant, meaningful difference between groups” (as if this is some natural law and not based on divisions of class, privilege, and access to resources) and the statement that “they cannot all even be proficient” (i.e., what’s the point of trying, they can’t learn anyway).

In Texas, standards-based assessment preceded NCLB by more than a decade. Over time, Texas policy makers have revised their assessments several times. Their most recent program, the Texas Assessments of Knowledge and Skills (TAKS), was introduced in 2003. It is based on tougher content standards and imposed tougher achievement standards than any prior Texas assessments. Texas also did something unusual when they introduced TAKS: they phased in their tougher achievement standards over several years, starting with standards that were two standard errors of measurement (SEM) below the level recommended by the standard setting panels. Table 1 shows the percentage of students passing the exit-level mathematics test between 2003 and 2006 based on different standards: 2 SEM below the panel recommendation, 1 SEM below the panel recommendation, and at the panel recommendation. The bold pass rates correspond to the standard that was used in a particular year.

Table 1: Percent of Students Passing TAKS Grade 11 Mathematics – Spring 2003 to Spring 2006

Table

Murray would probably be delighted with Table 1, because just as he predicted, the difference between black and white pass rates depends on the standard that is applied. On the other hand, Murray would also have to admit that passing percentages of both blacks and whites improved each year, regardless of the performance standard one might care to apply. The rise in test performance indicates that students are being taught the necessary skills they weren’t learning before. This is far from meaningless.

One aspect of Table 1 that Murray might take special note of is the rather astonishing increase in pass rates between 2003 and 2004. This increase did not surprise Texas educators at all. It turns out that the requirement to pass the exit-level TAKS tests did not apply in the first year of testing. Thus, the dramatic increase in passing rates between 2003 and 2004 is probably due in part to instructional changes and in part to changes in student motivation. It also provides some context for considering the use of NAEP scores as criteria for evaluating NCLB: NAEP does not measure any state’s content standards and there is little or no incentive for students to give their best performance. As the only national test available, NAEP is a convenient and available yardstick, but it was not designed to evaluate state assessment systems and its use for that purpose is of limited validity.

The politics of NCLB are complex and tend to encourage extreme positions. In calling NCLB “uninformative and deceptive,” Charles Murray has taken an extreme position that fails to recognize the rationale and merits of standards-based assessment. Irrespective of one’s views about NCLB, it is important that the public debate be an informed one, and the rhetoric of Charles Murray misses much of the issues that matter.

Monday, July 24, 2006

APA in the "Big Easy"

In an effort to continue the recovery following hurricane Katrina, the American Psychological Association (APA) has scheduled its annual convention in New Orleans, August 10-13. There are still hotel rooms available, and it promises to be a good conference. (See the brochure on the APA website.)

This conference is not just for psychologists. In fact, the College Board's Dr. Kathleen Williams and yours truly will be presenting a paper on the affinity of human scoring relative to individualized assessment administration. This session will focus on the similarities between the large-scale scoring of writing assessments by human readers and the scoring of individualized assessments typically requiring a professionally trained examiner. For more information, search the online program for the session title "Challenges in Scoring Open-Ended Responses" on the APA Convention website.

Wednesday, July 19, 2006

Been Thinking About...

Dr. Joshua Aronson, an associate professor at NYU Steinhart, was the keynote speaker at our last Iowa Educational Research and Evaluation Association (IEREA) conference, and I have been thinking about what he said ever since.

From his address, it is fair to say that Dr. Aronson does not believe that what psychometricians and test builders do to eliminate potential bias works. Rather, he talks about a bias that he labels "stereotype threat." That is, the elimination of stereotypes and biases is not enough—as long as the examinee thinks there might be bias, they behave as if there were bias. Dr. Aronson provided the following quote:

"I knew I was just as intelligent as everyone else...but for some
reason I didn't score well on tests. Maybe I was just nervous. There's
a lot of pressure on you, knowing that if you fail, you fail your race."

–Texas State Senator Rodney Ellis

In experimental settings, Dr. Aronson and his colleagues have shown that when stereotype threat is present, minority students underperform relative to when stereotype threat is not present. For example, African-American students doubled their performance solving problems on verbal tasks when they were not asked to indicate their race. When they were asked to provide their race, their performance was cut in half!

As measurement experts, we are diligent regarding our procedures to make assessments fair. We need to keep thinking about ways we can improve.

Tuesday, July 11, 2006

CCSSO Session Results

I found this year's CCSSO Large-Scale Assessment Conference particularly useful. I have been a critic of this conference in the past, believing that there were too many people pontificating about "how" assessment should be done without actually having done any themselves. This year, however, the conference was a very good mix of policy/political insight, applied measurement research, program advice, and empirically driven "theoretical" research. In addition, the sessions I attended were standing room only, indicating that some interest was peaked in the attendees. This must have been particularly true given such a nice venue as San Francisco as a distraction.

Don't take my word for it. Look over the papers and presentations posted at the CCSSO website. The sessions I attended or that were reported to be most interesting included: Monday Session 74 - "Using Technology to Create Innovative State Science Assessments: Pilots and Policy"; Monday Session 125 - "Measures of Student Achievement, Vertical Articulation, and the Realities of Large-Scale Assessment"; and Tuesday Session 38 - "What's Next In Online Testing?".

Thanks to CCSSO for posting these papers. This is a service we should all be excited about.

Thursday, July 06, 2006

First "Bulletin" Now Available

Pearson Educational Measurement (PEM) is proud to announce the first issue of our newly created Pearson Educational Measurement Bulletin. Our intent is to further the understanding of our industry and our profession by providing real-world explanations on pertinent topics related to test development, psychometrics, and educational assessment.

The first issue describes, in a non-technical way, the facts surrounding the current best measurement practice known as universal design (See PEM Research Report 05-04 for more information). This document links the requirements of NCLB with the desire to build the "least restrictive assessment environment" such that all students can participate in educational assessments fairly. Through universal design, assessments (both paper-based and electronic) will become more valid (supporting stronger inferences from student assessment results) because nonconstruct related variance will be reduced.

Written by researchers at PEM, this is a good introduction to universal design for those unfamiliar or needing more clarification, those needing a quick refresher, or those with familiarity who need a brief reference or resource.

New issues will be posted regularly and will cover a vast array of subjects. Some will be simple answers to frequently asked questions. Others will be more instructional with step-by-step guidance on your favorite measurement topics. You might even mistake some papers for empirical research! The only way you can tell is by dropping by our website from time to time to see what new topics have been posted. Or, you can make it easier by signing up to be notified of new releases.

Saturday, July 01, 2006

National Education Computing Conference (NECC) in Sunny San Diego!

One conference that gets some attention from trade organizations but really provides rich information regarding technology, assessment and learning, is the annual National Education Computing Conference(NECC). This year's conference is in San Diego, July 4-7.

A host of Pearson organizations will be present with information and demonstrations: Pearson Educational Measurement (PEM), Pearson School Systems (including PowerSchool and Chancery) and others.

PEM will be presenting our Perspective series of informative score reporting services, PASeries Writing and Algebra I formative assessments, and enhancements to other PASeries assessments. Certainly too much fun to pass up!

Thursday, June 15, 2006

A Point of Interest at the CCSSO Large-Scale Conference

For those of you attending the CCSSO National Conference on Large-Scale Assessment, consider attending one of the presentations organized by Pearson Educational Measurement.

To all those contemplating growth scales, the session “Measures of Student Achievement, Vertical Articulation and Realities of Large Scale Assessment” should be of interest. The presentation is Monday, June 26th at 1:30 p.m. in the Elizabethan D Room.

This program addresses issues encountered when implementation of growth scales are attempted across several separate statewide programs simultaneously. Implications for both state and federal policy will be discussed by two distinguished speakers, Dr. Edward Roeber from the Michigan Department of Education and Dr. Patricia L. Olson from the Minnesota Department of Education.

Drop by and learn about the challenges of applying growth models under NCLB.

Friday, June 02, 2006

New Measurement Resource

Inspired by the creative effort that has gone into each entry, Dr. Jon Twing and his staff have been looking into other formats to communicate the important issues facing educational measurement. So, coming this month, we will be announcing the Pearson Educational Measurement Bulletin, a regular examination of assessment and measurement topics. Look for the announcement on our website, www.PearsonEdMeasurement.com.

Monday, March 13, 2006

Basketball, Lesson Plans and Remembrance

I was fascinated with the article in the March 1, 2006 edition of Education Week titled “What John Wooden Can Teach Us,” written by Ron Gallimore (subscription required).

My fascination has many facets. I too remember my high school basketball coach. I remember the painstaking care he took in teaching us everything from the discipline of the free throw, to the intricacies of breaking a press, to the simple need to wear a hat after practice when walking home in the cold winter darkness. What is more impressive, however, is the validation of his efforts by Dr. Gallimore’s points. Practice did not just happen; instruction did not just happen; even drills did not just happen. These were planned, implemented, monitored, evaluated and adjusted by not only my coach but by his assistant coaches as well. Dr. Gallimore points out that these same procedures are the steps “master teachers” may take when designing an effective lesson plan. It’s funny how much we enjoy basketball but seldom enjoy lesson planning.

More information about/from Ron can be found at http://www.gseis.ucla.edu/faculty/pages/gallimore.htmll and http://www.llri.org/html/staff_gallimore.htm.

Tuesday, January 31, 2006

Where are all the “Chicken Littles” now?

It was not all that long ago that I recall suffering all the jokes from the media, professors and colleagues alike regarding the meaning of No Child Left Behind or NCLB. One professor joked that it should really be called “No Testing Company Left Behind.” One reporter suggested to me that it was really “No Psychometrician Left Unemployed.” Seems like there was no dearth of wit and doomsday reckoning of “the sky is falling” regarding my industry’s capacity to meet the requirements of NCLB. With all fifty states required to implement assessments in grades 3–8 in reading and mathematics—not to mention all the other requirements of ELL, special needs students, and Peer Review—many speculated that there simply were not enough psychometricians, content developers, and other measurement experts to fulfill the need.

Well, where are these “Chicken Littles” now? Especially in light of the recent announcements by both Harcourt and CTB/McGraw-Hill regarding their substantial layoffs in their assessment divisions? While this unfortunate news was reported, I have not seen even one headline like the following: “Nay-sayers Were Wrong, Excess Capacity in Assessment Industry Confirmed.” I find our collective short-term memory and lack of dissonance for inconsistencies reported in our news media and research reports to be very disconcerting. But then I guess I have always had a different perspective.