Tuesday, September 28, 2010

Trends in Alternate Assessments

While working for Pearson we have seen a shift in the assessment of students receiving special education services that, for the most part, seems to be in the right direction. The movement away from off-grade level testing was at first met with resistance and disbelief from state departments, contractors, and teachers but now seems to be reluctantly accepted. The introduction of the 1% (alternate achievement standards) and 2% (modified achievement standards) assessments forced states to look at the number of students taking special education tests. Federal accountability only allows states to count 3% of their student population as proficient if they test outside of the general assessment (more than 3% of students may take the 1% and 2% assessments, but if more than 3% are proficient they will not count as proficient for school, district, and state Adequate Yearly Progress reporting). Some states had a larger percentage of students being assessed with special education tests, and the development of participation requirements or guidelines became necessary to decrease the number of students taking these assessments and to try to make sure that the students were being assessed appropriately. There is still considerable controversy about whether students are being assessed with the right test.

The 1% assessment (designed for students with the most significant cognitive disabilities) allows the assessment of academic content standards through content that has been linked to grade-level standards. Common approaches to the 1% assessment have included checklists, portfolios, and structured observation, all requiring substantial teacher involvement. Assessment of these students using pre-requisite skills linked to the grade-level curriculum was not popular with teachers. Prior to No Child Left Behind Act (NCLB), many of these students had been taught life skills, and assessing academic content was criticized as being unimportant for this student population. However, once teachers started teaching the pre-requisite skills associated with the grade-level curriculum, we heard many positive reports. Teachers were surprised to find that their students could handle the academic content. One of the most common things heard at educator meetings were teachers saying that they never knew their students could do some of the things they were now teaching.

Providing psychometric evidence to support the validity and reliability of the 1% assessment has been challenging. The student population taking the 1% assessment is a unique and varied group. Creating a standardized test to meet the needs of the population requires assessment techniques (checklists, portfolios, and structured observations) that require a great deal of teacher involvement and fall outside the more traditional types of psychometric analyses used for multiple-choice assessments. The role of the teacher in the assessment of the 1% population is a source of controversy because of their high level of involvement. In order to develop a standardized test that requires a great deal of input from the teacher, additional research studies to evaluate reliability and validity should be included as part of the test development process. Approaches that we have seen used include interrater reliability studies and validity audits. These types of studies provide evidence that the assessments are being administered as intended and that teachers are appropriately evaluating student performance. These approaches provide evidence that the results of the 1% assessments are a true indication of what the student can do rather than what the teacher says the student can do.

The federal government’s legislation to allow an optional 2% assessment (for students who are not able to progress at the same rate as their non-disabled peers) has been met with varying levels of acceptance. At this time only two states have received approval on their 2% assessments and 15 states are in the process of developing a 2% assessment. However, there is some talk that with the reauthorization of the Elementary and Secondary Education Act (ESEA) and the movement towards common core the 2% assessment will go away. The communication from the Department of Education in the context of Race to the Top and Common Core has been that the students participating in 2% assessments should be able to participate in the new general assessments developed for the state consortia. However, actual changes to the NCLB assessment and reporting requirements will have to be legislated, most likely during ESEA reauthorization.

Many states simply did not bother to develop a 2% test. It’s an expensive endeavor for an optional assessment. Those states that have developed (or are in the process of developing) a 2% test have struggled to find a cost-effective approach to setting modified achievement standards and to modifying or developing grade-level items that are accessible to this group of students.

There seems to be a need for an assessment that falls between the 1% test and the general test offered with accommodations. But there are differences in opinion about how students should be performing on the 2% test. If students perform poorly is that to be expected because they shouldn’t be able to perform well on grade level material or does that indicate that the test has not been modified enough for these students to be able to show what they know? If students perform well on the assessment does that mean that the modifications have been done well or that the wrong students are taking the test? We would like to think that the intent of the legislation was for states to develop a test that assesses grade-level content in such a way that students could be successful. Even so, we have heard the argument that if students taking the 2% test are not doing well they are still performing better than they would have if they had taken the general test.

Most 2% assessments developed or in development use a multiple-choice format, and traditional psychometric analyses associated with multiple-choice items work well here. But there have been discussions about what the data for a 2% test should look like. Of particular interest is whether a vertical scale should or could be developed for a 2% assessment. Recent investigations show that the vertical scale data do not look like vertical scale data seen on general assessments, but it is unclear whether this is a problem or whether this is to be expected. Our initial recommendation to one state was not to develop a vertical scale since the vertical scale focuses on a year’s worth of student growth and a year’s worth of growth for a 2% student may be very different from what we see in the general population. But after collecting vertical scale data for that 2% assessment, the data looked better than expected though not close enough to a general assessment vertical scale to recommend its implementation. Further research is being conducted.

Growth models for both the 1% and 2% assessments are also being developed. Again, the type of growth expected from the students taking these assessments is questionable, especially for the 1%. The question is how to capture the types of growth these students do show. Models are being implemented now, and we are curious to see what the evaluation of these models will show. Are we able to capture growth for these students?

As the lifespan of alternate assessments under NCLB has increased, they have received increasing scrutiny under peer review and by assessment experts. The tension between creating flexible, individualized assessments and meeting technical requirements for validity and reliability has led to increased structure, and often increased standardization, in the development of alternate assessments. Yet, for myriad reasons, alternate assessments do not, and should not, look like the current primarily multiple-choice format of general assessments. The unique nature of alternate assessments has allowed psychometrics and other research to better understand non-traditional measurement. Providing reliability and validity evidence for assessments with alternate and modified achievement standards has required innovative thinking; thinking that has already been informing assessment design ideas for the common core assessment systems, which are expected to be innovative, flexible, and, to some extent, performance based.

Natasha J. Williams, Ph.D.
Senior Research Scientist
Psychometric and Research Services

Thursday, September 16, 2010

Under Pressure

I cannot seem to get the rhythmic refrain of the famous Queen/David Bowie song “Under Pressure” out of my head when thinking about Common Core and the Race to the Top (RTTT) Assessment Consortia these days. Yes, this is a remarkable time in education presenting us with opportunities to reform teaching, learning, assessment, and the role of data in educational decision making, and all of those opportunities come with pressure. But, when Bowie’s voice starts ringing in my head, I am not thinking about those issues. I am instead worried about a relatively small source of pressure the assessment systems must bear; a source of pressure that is about 2% of the target population of test takers in size.

Currently under Elementary and Secondary Education Act (ESEA), states are allowed to develop three categories of achievement standards: general achievement standards, alternate achievement standards, and modified achievement standards. These standards all refer to criteria students must meet on ESEA assessments to reach different proficiency levels. Modified achievement standards only became part of the No Child Left Behind Act (NCLB) reporting options in 2007* after years of pressure from states. It was felt that the general assessment and alternate assessments did not fully meet states’ needs for accurate and appropriate measurement of all students. There were many students for whom the alternate assessment was not appropriate, yet nor was the general assessment. These kids were often referred to as the “grey area” or “gap” kids.

I do not think anyone would have argued that the modified achievement standards legislation fully addressed the needs of this group of kids, but it did provide several benefits. States that opted to create assessments with modified achievement standards were able to explicitly focus on developing appropriate and targeted assessments for a group of students with identifiably different needs. The legislation also drew national attention in academics, teaching, and assessment to the issue of “gap” students. This raised important questions, including:

-- Which students are not being served by our current instructional and assessment systems?
-- Is it because of the system, the students, or both?
-- What is the best way to move underperforming students forward?

In the relatively short time since legislative sanction of modified assessments, significant amounts of research and development have been undertaken. However, as I asserted that the legislation did not fully meet the needs of “gap” kids, I also assert that the research and development efforts have yet to unequivocally answer any of the questions that the legislation raised. Though research has not yet answered those questions, this does not mean that the research has not improved our understanding of the 2% population and how they learn. And it does not mean that we should stop pursuing this research agenda.

Now, in the context of the RTTT Assessment competition, the 2% population seems to be disappearing, or is being re-subsumed into the general assessment population. I do not think that the Education Department means to decrease attention on the needs of students with disabilities or to negatively impact students with disabilities. There is still significant emphasis given to meeting the needs of students with disabilities and consistently underperforming students in the RTTT Assessment RFP and in the proposals submitted by the two K-12 consortia. However, the proposals do seem to indicate that the general assessment will need to meet the needs of these populations, offering both appropriate and accurate measurement of students’ Knowledge, Skills, and Abilities (KSAs) and individual growth. I wonder how much attention these students will receive in test development, research, and validation efforts when the test developers are also taxed with creating innovative assessments, designing technologically enhanced interactive items, moving all test-takers online, and myriad other issues. The test development effort is already under significant pressure before the needs of students previously involved in assessments with modified achievement standards were lumped in.

I applaud the idea of creating an assessment system that is accessible, appropriate, and accurate for the widest population of students possible. I also hope that the needs of all students will truly inform the development process from the start. However, I cannot help worrying. We are far from finished with the research agenda designed to help us better understand students who have not traditionally performed well on the general assessment. With so many questions left unanswered, and with so many new test development issues to consider, I hope that students with disabilities and under-performing students are not, once again, left in a “gap” in the comprehensive assessment system.

* April 19, 2007 Federal Register (34 C.F.R. Part 200) officially sanctioned the development of modified achievement standards.

Kelly Burling, PhD.
Senior Research Scientist
Psychometric and Research Services