Wednesday, April 15, 2009

Pearson at NCME & AERA

Pearson once again dominates presentations (well, we have a lot anyway) at the annual meetings of the National Council on Measurement in Education (NCME) and the American Educational Research Association (AERA).


NCME


Um, K., Way, W. D., Fitzpatrick, S. J., & Kreiman, C.
The Effects of Response Probability Criteria on the Scale Location Estimation and Impact Data in Standard Setting.

Wan, L., & Henly, G.
Measurement Properties of Innovative Item Formats in a Computer-Based Science Test

Arce-Ferrer, A.
An Investigation of Traditional and Alternative Approaches to Vertically Scale Modified Angoff Cut Scores

Seo, D., Shin, S., Taherbhai, H., & Sun, Y.
Exploring and Explaining Gender Format Differences in English as a Second Language Writing Assessment Using Logistic Mixed Models

Shin, C. D., Ho, T., Chien, Y., & Deng, H.
A Comparison of Person-Fit Statistics in Computerized Adaptive Test Using Empirical Data

Turhan, A., Courville, T., & Keng, L.
The Effects of Anchor Item Position on a Vertical Scale Design

Tong, Y. & Kolen, M.
A Further Look into Maintenance of Vertical Scales

Turhan, A., Lin, C., O’Malley, K. & Kolen, M.
Vertical Scaling for Paper and Online Assessments

Kahraman, N., & Thompson, T.
Relating Unidimensional IRT Parameters to a Multidimensional Response Space: A Comparison of Two Alternative Dimensionality Reduction Approaches

Meyers, J. L., Turhan, A., & Fitzpatrick, S. J.
Interaction of Calibration Procedure and Ability Estimation Method for Writing Assessments under Conditions of Multidimensionality

Shin, C. D, Chien, Y., & Way, W. D.
The Weighted Penalty Model and Conditional Randomesque Method for Item Selection in Computerized Adaptive Tests

Yi, Q.
The Impact of Ability Distribution Differences Between Beneficiaries and Non-Beneficiaries on Test Security Control in CAT

Mao, X. & Fitzpatrick, S. J.
An investigation of the Linking of Mathematics Tests with and Without Linguistic Simplification

Ming, X., Wang, J., & Wu, S.
A Predictive Validity Study of an English Language Proficiency Test

Arce-Ferrer, A.
Anchor-Test Design Issues in Equating

Chien, Y., Shin, C. D, & Way, W. D.
Weighted Penalty Model for Content Balancing in CAT

Keng, L. & Dodd, B. G.
A Comparison of the Performance of Testlet-Based Computer Adaptive Tests and Multistage Tests

Song, T. & Arce-Ferrer, A.
Comparing IPD Detection Approaches in Common-Item Non-Equivalent Group Equating Design

Wang, C., Wei, H., & Gao, L.
Investigating the Effects of Speededness on Test Dimensionality

Wei, H.
The Effect of Test Speededness on Item and Ability Parameter Estimates in Multidimensional IRT Models

McClarty, K., Lin, C., & Kong, J.
How Many Students Do You Really Need? The Effect of Sample Size on the Matched Samples Comparability Analysis

Thompson, T.
Scale Construction and Conditional Standard Errors of Measurement

Ye, F., You, W., & Xu, T.
Multilevel Item Response Model for Longitudinal Data


NCME Training Sessions


Kolen, M. & Tong, Y.
Vertical Scaling Methodologies, Applications and Research


NCME Discussants


Shin, C. D.
Standard Errors of Equating

Way, W. D.
Current Practices in Licensure and Certification Testing

Tong, Y.
Test Equating with Constructed-Response Items and Mixed-Format Tests


NCME Moderators


Dolan, R.
Comparability of Paper-and-Pencil and Computer-Based Exams

Nichols, P. D.
Modifications of Traditional Methods of Setting Standards


AERA


Fulkerson, D., Nichols, P. D., Mislevy, R., Liu, M., Zalles, D., Fried, R., Villalba, S., Debarger, A., Cheng, B., Mitman, A., Haertel, G., & Cho, Y.
Research Findings: Leveraging ECD in Scenario-Based Science Assessments

Dolan, R., Way, W. D., & Nichols, P. D.
Technical Quality of Formative Assessments Within Online Instructional Tool

Fulkerson, D., Mittelholtz, D., & Nichols, P. D.
The Psychology of Writing Items: Improving Figural Response Item Writing

Lau, C., Zhang, L. & Jiang, X.
Using Pass/Fail Pattern to Predict Students’ Success for Standards: A Longitudinal Study with Large-Scale Assessment Data

Stephenson, A. & Song, T.
Using HLM to Investigate Longitudinal Growth of Students’ English Language Proficiency

Beretvas, S. N., Chung, H., & Meyers, J. L.
Modeling Rater Severity Using Multiple Membership, Cross-Classified, Random-Effects Models

Harrell, L. & Wolfe, E. W.
A Comparison of Global Fit Indices as Indicators of Multidimensionality in Multidimensional Rasch Analysis

Yue, J., Creamer, E., & Wolfe, E. W.
Measurement of Self-Authorship: A Validity Study Using Multidimensional Random Coefficients Multinomial Logit Model

Murphy, D. L.
Multilevel Growth-Curve Modeling: A Power Analysis of the Unstructured Covariance Matrix

Arce-Ferrer, A., Song, T., & Sullivan R.
Linking Strategies and Item Screening Approaches: A study with Augmented Nationally Standardized Tests Informing NCLB

Shin, C. D. & Chien, Y.
Conditional Randomesque Method for Item Exposure Control in CAT

Thompson, T. & Way, W. D.
Using CAT to Achieve Comparability with a Paper Test

Yang, Z. & Wang, S.
Calibration Methods Comparison with the Rasch Model

Shin, C.D.
Using Bayesian Sequential Analyses in Evaluating the Prior Effect for Two Subscale Score Estimation Methods

Beimers, J.
Consistency of District Annual Yearly Progress (AYP)Determinations Across Three Types of NCLB Growth Models

Wang, Z., Taherbhai, H., Xu, M. & Wu, S.
Modeling Growth in English Language Proficiency with Longitudinal Data Using the Latent Growth Curve Model

Way, W. D.
Increased Use of Technology in Testing

McGill, M. & Wolfe, E. W.
Assessing Unidimensionality in Item Response Data via Principal Component Analysis of Residuals from the Rasch Model

Ingrisone, J.
Modeling the Joint Distribution of Response Accuracy and Response Time

Ingrisone, S.
An Extended Item Response Model Incorporating Item Response Time

Sung, H.
Developing a Short Form of the Enright Forgiveness Inventory Using Item Response Theory

Arce-Ferrer, A., Wang, Z., & Xue, Q. (2009)
Applying Rasch Model and Generalizability Theory to Study Modified Angoff Cut Scores for Reporting with Vertical Scales

McGill, M., Wolfe, E. W., & Jarvinen, D.
Validation of Measures of the Quality of Mentoring Experiences of New Teachers

Jiao, H., Wang, S., Wan, L. & Lu, R.
An Investigation of Local Item Dependence in Scenario-Based Science Assessments

Tsai, T. & Shin, C. D.
Generalizability Analyses of a Case-Dependent Section in a Large-Scale Licensing Examination


AERA Discussants


Tong, Y.
Factors Influencing Equating Accuracy

Nichols, P. D.
Standards, Proficiency Judgments, and Norms


AERA Chairs


Mueller, C.
Studies Examining Achievement Gaps

Monday, March 09, 2009

NAEP: Love It or Leave It!

As the national debate about what to do with education reform rages, I hear repeatedly the need to have national standards, common core standards, or at the very least, a more standardized system of national guidelines in order for us to measure and, presumably then, improve education in America. Often this discussion leads to debate about the merits of a national test—which is often assumed to be NAEP, sometimes called the "nation's report card.”

NAEP is well researched, well documented and seems to be well loved if not revered by most psychometricians—other than me and a few others who dare to challenge the status quo. I have questioned the usefulness of NAEP as a "check test" for NCLB at various times in my career, all based on the following premise:
  • Students who take NAEP are essentially unmotivated; while most students are highly motivated to pass mandated state assessments.
  • NAEP essentially measures a "consensus" national curriculum; while state assessments measure very specific content standards, which presumably align or mirror instruction.
  • NAEP is individually administered via a specific student sample where students only take portions of the assessment; whereas all students take the complete statewide assessment.
  • NAEP was targeted to measure at a higher level of proficiency (for example, 39% of all students were at or above proficient on NAEP Mathematics in 2007); whereas for most statewide assessments the percentages were much larger.
When I speak with my colleagues most of them say things like: Lighten up! NAEP is great! Math is math. Etc. Oddly enough, for a group of well-trained colleagues who claim to be scientists, such indefensible positions leave me wanting. Good news—there is some sensible research in the measurement literature. I am referring to the article by Dr. Andrew Ho, from the University of Iowa, published in Educational Measurement: Issues and Practices. In this article, entitled Discrepancies Between Score Trends from NAEP and State Tests: A Scale-Invariant Perspective, Dr. Ho provides a balanced and well-reasoned argument regarding the usefulness of NAEP for such comparisons. I provide quotes (albeit taken out of context) such that your curiosity will be peaked and you will seek out and read his entire article.

“Given a perspective that NAEP and State tests are designed to assess proficiency along different content dimensions, State-NAEP discrepancies are not cause for controversy but a baseline expectation.”

“Trends for a high-stakes, or ‘focal’ test, may differ from trends on a low-stakes test like NAEP (an ‘audit’ test) for a number of reasons, including different "elements of performance" sampled by both tests, different examinee sampling frames, or differing changes in student motivation.”

“As NAEP adjusts to its confirmatory role, there must be an additive effort to temper expectations that NAEP and State results should be identical.”

My colleagues, who have violated this last conclusion by Dr. Ho, are doing the policy makers and implementers of the NCLB "law of the land" a disservice by suggesting that statewide assessments are somehow inferior simply because their results are not replicated on NAEP. Let's get back to speaking about the science of assessment and experimental comparison and leave the passion and politics to someone else.

Monday, November 03, 2008

Policy Wonks We Are—Implications for NCME Members

This is the "full text" of a contribution I made to the NCME newsletter. I thought you might like to get the full inside story!

Mark Reckase’s call for NCME members to become more involved in educational policy is timely and relevant, while perhaps also a little misleading. For example, some of my colleagues and I have been working with states, local schools, and the USDOE regarding implementing policy decisions for many years. Testifying at legislative hearings, making presentations to Boards of Education, reviewing documents like the Technical Standards, and advising policy makers are all examples of how psychometricians and measurement experts already help formulate and guide policy. Nonetheless, I still hear many members of technical advisory committees (experts in psychometrics and applied measurement) “cop out” when asked to apply their experience, wisdom, and expertise to issues related to education policy, often citing that they are technical experts and the question at hand is “a matter of policy.”

I have commented and I believe that we no longer live in a world where the policy and technical aspects of measurement can remain independent. In fact, some good arguments can be made that when such independence (perhaps bordering on isolation) between policy and good measurement practice exists, poor decisions can result. When researchers generate policy governing the implementation of ideas, they must carefully consider a variety of measurement issues (e.g. validity, student motivation, remediation, retesting, and standard setting) to avoid disconnects between what is arguably good purpose (e.g. the rigorous standards of NCLB) and desired outcomes (e.g. all students meeting standards).

In this brief text, I will entertain the three primary questions asked by Dr. Reckase: (1) Should NCME become more involved in education policy? Why or why not? (2) How should other groups and individuals in the measurement community be involved in education policy? (3) What resources and supports are necessary to engage measurement professionals in education policy conversations? In what ways should NCME be involved in providing these?

I think I have already answered the first question, but let me elaborate. I maintain that we measurement professionals are already involved in policy making. Some of us influence policy directly (as in testifying before legislatures developing new laws governing education). Some of us influence policy in more subtle ways, by researching aspects of current or planned policy we do not like or endorse. We often seek out the venue of conference presentations to voice our opinions regarding what we think is wrong with education and how to fix it, which inevitably means we make a policy recommendation.

Not only do I believe that NCME and its members are involved in policy making, but I also believe it is critically important for all researchers and practitioners in the measurement community to seek out opportunities to influence relevant policy. I recall recently being involved in some litigation regarding the fulfillment of education policy and the defensibility of the service provider’s methods. After countless hours of preparation, debate, deposition, and essentially legal confrontation, I asked my colleague (also a measurement practitioner) why we bother defending best practice when there are so many agendas, so many different ways to interpret policy, so many points of view regarding the “correct way” to implement a measure. Her response was surprising—she said we do it because it is the “right thing to do” and that if we stop defending the right way to do things, policy makers will make policy that is convenient but not necessarily correct. Her argument was not about defining right from wrong; her argument was that if we were not there instigating debate there would be none, and resulting decisions would most likely be poorly informed.

So, my simple answer to the second question is to get involved. If you don’t like NCLB, what did you do to inform the policy debate before it became the law of the land? If you think current ESL, ELL, or bilingual education is insufficient to meet the demands of our ever-increasing population in these areas, what are you doing to help shape policy affecting them? Across the country debate rages regarding the need for “national standards” or state-by-state comparability. Why aren’t NCME, AERA, and all other organizations seemingly affected by such issues banding together to drive the national debate? Do we not all claim to be researchers? If so, is not an open debate what we want and need? When was the debate where it was decided that the purpose of a high school diploma was college readiness? When did we agree to switch the rhetoric from getting everyone "proficient" by 2014 to getting everyone “on grade level” by 2014? The input of measurement experts was sorely missing in state legislation regarding these issues. It is still desperately needed.

For the purpose of this presentation, let’s assume that all measurement and research practitioners are in agreement that we need to take part in policy discussions directly. What resources, tools, and/or procedures can we use to implement these discussions and how can NCME help? I stipulate that there is a feeling of uneasiness surrounding the engagement of researchers and measurement practitioners in policy debates or decisions.

Perhaps this is an unfounded concern, but there seems to be an air, forgive me, of such debates being below our standards of scientific research. Policy research is very difficult (to generate and to read), so why leave the comforts of a safe “counter-balanced academic research design” to mingle with such “squishy” issues as the efficacy of policy implementation? Perhaps NCME could strive for a division or subgroup on Federal and State Policy that would focus on measurement research as it applies to education policy (policy, law making, and rule implementation) to lend more credibility to such a scientific endeavor. Maybe NCME could work with other groups with similar interests (like AERA, ATP, CCSSO) and maybe even get a spot in the cabinet of the next Secretary of Education for the purpose of promoting the credibility of measurement research and application for informing policy. Perhaps less ambitious things like including more policy research in measurement publications, sponsoring more policy discussions and national conventions, and encouraging more policy-related coursework in measurement-related Ph.D. programs would be a good place for NCME (and other organizations) to start.

Let me close with a simple example of why this interaction between applied measurement and education policy is so important. Many of you are firm believers in the quality of the NAEP assessments. Some of you have even referred to NAEP as the “gold standard” for assessment. NAEP is arguably the most researched and highest quality assessment system around. Yet, to this day many of my customers (typically the educational policy makers and policy implementers in their states) ask me simple questions: Why is NAEP the standard of comparison for our NCLB assessments? NAEP does not measure our content standards very well; why are our NAEP scores being scrutinized? What research exists demonstrating that NAEP is a good vehicle to judge education policy—both statewide and for NCLB?

Don’t get me wrong. My argument here is not against NAEP or the concept of using NAEP as a standard for statewide comparability. My question is why my customers—the very people making educational policy at the state level—were not at the table when such issues were being debated and adopted? Did such a debate even take place? As measurement experts, when our customers come to us for advice or guidance, or with a request for research regarding the implementation of some new policy, I believe it is our obligation to know and understand the implications of such a request from a policy point of view, not just a measurement point of view. Otherwise, we will be acting in isolation and increasing the divide between sound measurement practice and viable educational policy.

Wednesday, July 09, 2008

Why I Stopped Reading Editorials

I gave up reading editorials quite a long time ago. Not because they are too often misleading or inaccurate (many of them are), or that they are too often purposely written to be controversial and sensational (again, many of them are). Rather, I quit reading because the whole purpose of editorials seems rather futile to me.

Let me explain. People who write editorials usually have a strong position with reasons and rationales why they feel that particular way. Informed readers of editorials either agree with that position and its reasons and rationales, or they disagree, usually from a strong position with a direct opposite viewpoint from that of the editorial writer. In either case, the editorial does little to change someone’s opinion; it just stirs up a lot of emotion. Therefore, the only people who might benefit from reading editorials are those who have not yet made up their minds. However, if the topic is well enough defined to cause a debate in the editorial pages, I wonder how many people really have no position or opinion. Hence, the futility. So I just quit reading them.

Occasionally, friends, family, colleagues or even readers of TrueScores send me editorials and ask for a reaction or an opinion. Not too long ago this happened regarding Dr. Chris Domaleski's op-ed “Tests: Some good news that you may have missed,” from May 29, 2008, in the Atlanta Journal-Constitution. Chris is a colleague, customer, and friend of mine; and I found his comments to be very well written, well supported, and his message very helpful for all those impacted by testing in Georgia. His message, simplified and summarized, was: testing is complicated, necessary and beneficial and ill-informed rhetoric does not help improve learning. (This is my summary of his message and not his own words.)

Unfortunately, it would seem the ill-informed rhetoric continues. I am referring to Michael Moore's post on SavannahaNOW, called "Politics of school testing." It is too bad, but apparently Mr. Moore did not read Dr. Domaleski's comments. First, Mr. Moore claims that the state "blindsided" the schools regarding the poor results on the state’s CRCT. I don't know how this can be as the law of the land has required that states move to "rigorous" content standards and further, this law expects that no child is left behind in attaining those standards. Georgia has implemented a new curriculum with teacher and educator input. Field testing, data review, content review, alignment reviews have been conducted by educators across Georgia. All of this conducted under the "Peer Review" requirements of the Federal NCLB legislation. Passing standards were established with impact data and sanctioned by the State Board of Education. How can anyone be blindsided by such an open and public action?

Mr. Moore also states that he has seen no analysis of the assessment and no discussion of how "...a curriculum and test can be so far out of line." Hmm… I wonder if Mr. Moore is not more upset with the poor performance of the students. It could be the curriculum and the assessment fit very well together. In fact, the required alignment studies as well as educator's working with the Department to review the items should ensure they are aligned. Since the curriculum is new, perhaps the students have not learned it as well as they should.

Mr. Moore then mistakenly claims that the CRCT test in Georgia is constructed out of a huge bank of questions the test service provider (in this case CTB/McGraw-Hill) owns and is part of a "...larger national agenda." I am not much into conspiracy theories, but a quick review of the solicitation seeking contractor help would reveal that the test questions are to be created for use and ownership in Georgia only. Mr. Moore also claims that the multiple-choice format "...seldom reflect the actual goals of the standards." I admit, some things are difficult to measure with multiple-choice test questions—for example, direct student writing—yet many aspects of the learning system do lend themselves to objective assessment via multiple-choice and other objective test questions.

I don't want to get into a debate with Mr. Moore about how the State of Georgia manages the trade-offs between budget pressures (multiple-choice questions are much less expensive in total than subjective but rich open-ended responses) and curriculum coverage of more difficult aspects of the curriculum he outlines, such as inquiry-based activities. It is an over simplification, however, to simply dismiss the issues and suggest or imply that all would be well if Georgia abandoned objective measures.

At the end of the day, I disagree with Mr. Moore and agree with Dr. Domaleski that less rhetoric and more fact-based discussions are needed. If we build the test to measure the curriculum, and the curriculum is new and rigorous, it is unlikely that students will perform well at first. If we build a test where all students perform well, what good does a new and rigorous curriculum get us? Students will receive credit without learning.

Wednesday, June 04, 2008

The Academic Debate about Formative Assessments

There are some things in educational measurement that are not debated. Foremost, the purpose of instruction is to improve learning. The purpose of assessment is to improve instruction, which in turn improves learning. In other words, it’s all about the learning—debate over.

Some researchers (myself included) have become sloppy with our language, labeling assessments "for learning" to be formative and assessments "of learning" to be summative. So, under this lax jargon, a multiple-choice quiz used by the teacher in the classroom at the end of instruction for the purpose of tailoring additional instruction would be deemed "formative." If you follow the rhetoric from national "experts," technical advisory committees, or other learned people, then I have just offended many!

Currently, there is much discussion regarding formative assessments and the need to balance the multitude of assessments that might be used during a school year. A good place to start might be with the paper by Perie, et. al. (2006) posted to the CCSSO SCASS website. You and I might not agree with the classifications or the terminology, but the classification scheme used by these authors helps to contextualize the debate quite well and may even allow you to make up your own mind.

What does, however, put peanut butter into my cognitive gears is all the arguments and wasted effort I hear regarding what exactly does or does not constitutes a "real" formative assessment. I even heard one nationally recognized measurement expert comment that, by definition, no assessment constructed by anyone other than a teacher can be called a formative assessment. I try to remind myself (and others) that at the end of the day only one thing matters: What have you done to improve learning? I doubt that arguing about definitions of formative, benchmark, or interim assessments helps with this.