Sign In

​​​​​​​Assessment​Assessment Bibliography​​​Download Assessment Bibliography

Assessment is a critical and integral part of education. Any effective educational system must use reliable, valid, and fair assessments that convey accurate information about student learning to educators and other stakeholders. There are two major types of assessments: summative and formative. Summative assessments are administered at predetermined times to evaluate student learning; common examples of summative assessments are the statewide end-of-year tests that are increasingly being used for accountability purposes. Formative assessments, on the other hand, are used to help shape instruction and student learning experiences. Much recent research focuses on formative assessments, which both theory and empirical evidence suggest play important roles in enhancing and supporting student learning. Effective educational systems, however, implement both formative and summative assessments. The following annotated bibliography summarizes salient recent literature on educational assessment.

Formative Assessments (To Inform Instructional Practices)

Arter, J. A., & Chappuis, J. (2006). Creating & recognizing quality rubrics. Upper Saddle River, NJ: Pearson.

Rubrics have been used by teachers for many years and have been the subject of numerous studies. Research clearly shows that when rubrics are created and used properly, they tend to be effective. This practitioner-oriented book and CD-ROM set provides information on designing and using—as well as helping others to use—rubrics. 

Black, P., & Wiliam, D. (1998). Inside the black box: Raising standards through classroom assessment. Phi Delta Kappan, 80(2), 139–148.

This article remains one of the seminal, practitioner friendly pieces of assessment literature. Black and Wiliam begin by discussing the inputs of education and the outputs of student learning, referring to the actions that happen in the classroom (teaching and learning) as what happens "inside the black box." Through extensive review of research, the authors have determined three important factors when dealing with formative assessment: (1) formative assessments can raise test scores; (2) formative assessments can be improved; (3) there is evidence for how to improve the assessments. The article makes a strong argument for widespread and fundamental change, but cautions the public to be wary of those who would claim this as a "magic bullet."

Furtak, E. M., Ruiz-Primo, M. A., Shemwell, J. T., Ayala, C. C., Brandon, P. R., Shavelson, R. J., et al. (2008). On the fidelity of implementing embedded formative assessments and its relation to student learning. Applied Measurement in Education, 21, 360–389.

Teachers vary significantly in their academic and professional expertise. Moreover, classroom, school, district, and state educational environments vary significantly across different locations. Given this variation in teacher expertise and classroom environment, there can be wide variation in how individual teachers implement the same curriculum. As such, analyses of assessment systems' effects on student learning must take into account the fidelity of teachers' implementations of such systems.

This experimental study examines the effects of fidelity of implementation of a formative assessment system on student learning. The study population consisted of six middle-school physical science teachers and their students. The targeted assessment system was a series of formative embedded assessments devised by the Assessment Development Team; teachers were trained in this assessment system during a summer institute. During the school year, teachers were asked to videotape their lessons; these videotapes were coded to determine the extent to which each teacher implemented the assessment system as originally intended. Student learning was measured by pre- and post-test assessments containing multiple choice and short-answer questions related to the content of the curriculum. The results show a 0.71 correlation between student learning and teachers' fidelity of implementation.

Gallagher, C., & Worth, P. (2008). Formative assessment policies, programs, and practices in the Southwest region. Washington, DC: U.S. Department of Education, Institute of Education Sciences, National Center for Education Evaluation and Regional Assistance, Regional Educational Laboratory Southwest.

Formative assessments are important instructional tools because they produce descriptive metrics on student achievement and can thus be used for purposes of diagnosis and instructional modification. Little work has been done, however, to elucidate how state educational policies related to formative assessment translate into district- and school-level practice. The authors summarize the characteristics of formative assessment systems in Arkansas, Louisiana, New Mexico, Oklahoma, and Texas. Data were collected via interview and document analysis. The authors find a large amount of variation in formative assessment policies among the study states. Two states lacked any formal definition of formative assessment at all; variations existed among the definitions espoused by the three states that did have formal definitions of formative assessment. The authors also note variation in the extent to which the study states regulate and provide professional support related to formative assessments. This report suggests that additional work is needed to ensure a clear and consistent approach to formative assessment at the state, district, and school levels.

Graham, S., Harris, K., & Herbert, M. (2011). The benefits of formative assessment. Washington, DC: Alliance for Excellent Education.

Poor writing skills contribute significantly to many American high school students' failure to graduate. Furthermore, many high school graduates do not write well enough to be competitive in college and the workplace. The authors, citing a need for more evidence-based practices to help bolster American students' writing, propose classroom-based writing assessments as a tool to help teachers improve their students' writing. The classroom-based writing assessment is a formative approach that allows teachers to evaluate their instructional effectiveness, modify their instructional practices, and provide feedback to students. The report provides a meta-analysis to evaluate whether formative writing assessment can improve student writing. The authors conclude that formative writing assessment is effective in helping students improve their writing. The authors identify best practices for formative writing assessment and issue various recommendations to help teachers implement such assessments in the classroom.

Hattie, J., & Timperley, H. (2007). The power of feedback. Review of Educational Research, 77(1), 81–122.

Feedback is an integral aspect of instruction and learning, but the type of feedback and the way it is given can be differentially effective. This article provides an analysis of feedback and reviews evidence gathered from 12 previous meta-analyses (of 196 studies total) related to the impact of feedback on learning and achievement. The authors discuss the different effects of feedback based on who provided it, the type of feedback provided, and when and how it was provided. Applications for classroom instruction and interventions with struggling students or those who have learning disabilities are also presented.

Heritage, M. (2007). What do teachers need to know and do? Phi Delta Kappan, 89(2), 140–145.

To many of today's teachers, assessments are often viewed in competition with teaching instead of as a tool to inform instruction and move learning forward. This article discusses the core elements of formative assessment, conceptualizes their integration, and considers the knowledge and skills that teachers need in order to implement formative assessment effectively.

Heritage, M., Kim, J., Vendlinski, T., & Herman, J. (2009). From evidence to action: A seamless process in formative assessment?. Educational Measurement: Issues and Practice, 28(3), 24–31.

This report follows a study of teacher mathematics knowledge conducted by the National Center for Research on Evaluation, Standards, and Student Testing at UCLA. The study results showed that while teachers were able to determine student performance levels from assessment data, they had difficulty determining how to modify instruction in response to evidence of student learning. The authors argue that teachers must understand how learning progresses in their content domain so that they know what to do next given a student's current learning level. Implications for effective formative assessment and considerations on how teachers can be supported are discussed.

Hunt, E., & Pellegrino, J. W. (2002, Spring). Issues, examples, and challenges in formative assessment. New Directions for Teaching and Learning, 89, 73–85.

There has been a large body of recent research on human cognition and its impact on learning. In light of this work, the authors argue that the types of assessments commonly used for certification and prediction are not necessarily best suited to facilitating student learning. In particular, prior descriptive and experimental research, in addition to research reviews, affirm the efficacy of continuous formative assessments, which create a two-way interaction between students and teachers such that instruction can be targeted to students' particular knowledge states. However, formative assessment, despite its benefits, has not been implemented widely because it increases teacher burden. The authors propose the use of technology to alleviate this burden and therefore increase formative assessment use in the classroom. The authors then examine two assessment models, DIAGNOSER and SMART, which attempt to address current issues in assessment by injecting aspects of continuous formative assessment into instruction.

Koedinger, K. R., McLaughlin, E. A., & Heffernan, N. T. (2010). A quasi-experimental evaluation of an online formative assessment and tutoring system. Journal of Educational Computing Research, 43(4), 489–510.

There is much research and anecdotal evidence suggesting that computer-based assessment can increase student learning due to the ability of computer systems to generate immediate and targeted feedback. Furthermore, classroom computer use can alleviate teacher burden by automating the time-consuming process of grading. In particular, the ASSISTments program is a web-based mathematics tutoring system for middle school students. The authors conducted a quasi-experimental study to examine the effect of the ASSISTments system on the year-end test scores of middle school students. The study population consisted of 1,240 seventh-grade students from four schools in an urban Massachusetts school district. ASSISTments was deployed at three of the four study schools. Student achievement was assessed via the Massachusetts Comprehensive Assessment System (MCAS) at the end of their sixth- and seventh-grade years. The results show that students who used ASSISTments perform, to a statistically significant extent, better on the year-end mathematics assessment than students who did not use the system. This performance gap is particularly salient for students in special education. Furthermore, students who used the system more demonstrated higher MCAS scores. The study results support the use of computer-based assessments in the classroom to improve student achievement.

Louie, J., Sanchez, M. T., North, C., Cazabon, M., Mello, D., & Kagle, M. (2012). A descriptive analysis of state-supported formative assessment initiatives in New York and Vermont. Washington, DC: U.S. Department of Education, Institute of Education Sciences, National Center for Education Evaluation and Regional Assistance, Regional Educational Laboratory Northeast and Islands.

A significant body of research evidence shows that formative assessments can enhance student achievement. As more educational entities seek to implement formative assessment systems, there is a need for information about how such systems have been implemented in practice. The authors analyze the components and support structures of two state-endorsed, local-level formative assessment initiatives. The two initiatives were the New York State Formative Assessment Technical Assistance Study (NY–FATA) and the Vermont Formative Assessment Project (VT–FAP). NY-FATA was a three-year pilot program (2007–10) conducted in a large urban school district; VT-FAP is an ongoing pilot incepted in 2006 and spearheaded by the State Department of Education. Study data were collected via interviews and analysis of publicly available information. The authors summarize eight components of the two formative assessments: goals, origins, leaders and roles, use of external partners, targeted participants, funding, professional development, and formal evaluation of implementation. The authors also discuss nine strategies that both the NY-FATA and VT-FAP used to promote, sustain, and disseminate information regarding the initiatives. Salient strategies included providing training by experts, identifying local partnerships early, and promoting safe spaces for teachers to try novel teaching strategies.

Militello, M., Schweid, J., & Sireci, S. G. (2010). Formative assessment systems: Evaluating the fit between school districts' needs and assessment systems' characteristics. Educational Assessment, Evaluation and Accountability, 22, 29–52.

This article examines the importance of "fit" between school districts' intended purposes for using formative assessment systems and the actual characteristics of such systems. Recent federal-level education reform initiatives, like No Child Left Behind, have led to a shift in how educational reform is achieved in practice at the district level: whereas previously districts would simply implement prescribed educational strategies, districts now focus on obtaining measurable student outcomes and on using assessment data to increase student achievement over the course of a single school year. Traditional assessment methodologies, however, are largely summative, and results generally arrive too late to allow teachers to implement pedagogical reforms within a single school year. Formative assessment systems are a possible solution to the need to rapidly obtain actionable assessment data to inform within-school year instruction. The authors conducted a 1.5-year study at three anonymous school districts in the Northeast that each had implemented a different formative assessment system in middle school mathematics. In order to evaluate "fit," two sets of qualitative data were obtained: district data were collected via interviews with educators, and data about the formative assessment systems were collected via interviews with test developers and via consultation of technical documentation. The results indicate that fit between district need and formative assessment system varies dramatically depending on the stated use of the assessment system versus its actual use within the districts. The authors conclude that fit is the most important factor in choosing a formative assessment system; poor fit can lead to user frustration or misuse of data.

Miller, D., & Lavin, F. (2007). "But now I feel I want to give it a try": Formative assessment, self-esteem and a sense of competence. The Curriculum Journal, 18(1), 3–25.

This descriptive study examined the effects of formative assessment on student's self-perceptions. The student study population consisted of 370 10–12 year-olds and their associated teachers in four school districts in eastern Scotland. The 16 study teachers were selected because they used various formative assessment strategies in their classrooms. Student self-perceptions were measured quantitatively via Rosenberg's Self-Esteem Scale and qualitatively via group discussions and interviews with both students and teachers. The results show that, over the course of the treatment period, students' self-worth and self-competence increased to a statistically significant extent. Males demonstrated increases in both self-worth and self-competence whereas females showed increases only in self-competence. Students who benefited most from the implementation of formative assessment techniques were those who were identified by their teachers as lacking self-confidence in learning. Finally, students who had prior exposure to formative assessment showed statistically significant gains in self-perception whereas students new to formative assessment did not show gains to the point of statistical significance. Overall, this study supports the use of formative assessment in increasing student self-perceptions, although the authors caution that significant changes in student self-perceptions may not manifest in the short-term.

Pinchok, N., & Brandt, W. C. (2009). Connecting formative assessment research to practice: An introductory guide for educators. Naperville, IL: Learning Point Associates.

Formative assessments are the most instructionally sensitive types of assessment because they are embedded within instructional activities and are linked to ongoing teaching and learning activities in the classroom. This report defines formative assessment, examines the research, and outlines the components needed to develop a high-quality, research-based formative assessment plan with the intention of building the knowledge and capacity of state, district, and school personnel to implement effective formative assessment practices.

Ross, J. A., Hogaboam-Gray, A., & Rolheiser, C. (2002). Student self-evaluation in grade 5–6 mathematics: Effects on problem solving achievement. Educational Assessment, 8(1), 43–59.

This study looked at the effect of student self-evaluation on achievement in mathematics. Study participants consisted of 24 teachers of fifth- and sixth-grade mathematics students. Twelve teachers were assigned to the treatment group, which was trained in how to implement systematic self-evaluation procedures in the classroom. The other twelve teachers were assigned to a control group that did not make use of such procedures. Student achievement was assessed via pre- and post-treatment performance tasks. Students also completed self-evaluation questionnaires at pre-treatment. The results show that student achievement was higher in the treatment classrooms, in which self-evaluation procedures were implemented, than in the control classrooms. The study supports the use of student self-evaluations in improving student achievement in mathematics.

Sharkey, N. S., & Murnane, R. J. (2006). Tough choices in designing a formative assessment system. American Journal of Education, 112(4), 572–588.

This article allows the reader to follow one large urban school district's attempts to implement a formative assessment system to measure students' mathematics skills and the lessons it learned. With interviews from teachers and administrators, the article presents the difficult choices educators face when choosing formative assessments and deciding how to use them.

Stiggins, R. J. (2001). Student-involved classroom assessment. Upper Saddle River, NJ: Prentice Hall.

Written for teachers and preservice teachers, this book guides readers on how to assess students in their daily activities while emphasizing student self-assessment. The author explains how to match achievement targets to assessment methods and what each type of assessment method can and cannot assess. Clear and practical guidelines are also provided for constructing classroom assessments.

Yin, Y., Shavelson, R. J., Ayala, C. C., Ruiz-Primo, M. A., Brandon, P. R., & Furtak, E. M. (2008). On the impact of formative assessment on student motivation, achievement, and conceptual change. Applied Measurement in Education, 21, 335–359.

Formative assessments show great promise in improving student achievement. However, much of the support for the efficacy of formative assessments does not derive from real-world, field-based educational settings. This study provides field-based analyses of the effects of formative assessments on student achievement. The study population consisted of 12 middle-school science teachers and their students. The teachers were randomly assigned to either an experimental cohort that was trained in the use of embedded formative assessments in the classroom or a control cohort that was not trained to use such assessments. Teachers in both cohorts taught the same science unit using an identical curriculum. Students took pre- and post-tests on motivation and science knowledge. Students also took post-tests on conceptual change. The results show that embedded formative assessments have no statistically significant effect on student motivation, achievement, or conceptual change. The authors also find wide variance in student learning outcomes among different classrooms. Overall, the study suggests that formative embedded assessments are efficacious only when teachers modify their instructional practices as a result of such assessments.

Interim or Progress Monitoring Assessments (To Inform Teachers of Student Progress Within the School Year)

Goertz, M. E., Oláh, L. N., & Riggan, M. (2009). From testing to teaching: The use of interim assessments in classroom instruction. Philadelphia, PA: Consortium for Policy Research in Education.

Interim assessments are used to evaluate student knowledge and skills within a limited time frame. These assessment results can then be aggregated and analyzed across classrooms, schools, and districts. This study, focusing on mathematics, examines the use of interim assessments in nine elementary schools and discusses the implications for the design of more effective interim assessments.

Henderson, S., Petrosino, A., Guckenburg, S., & Hamilton, S. (2007). Measuring how benchmark assessments affect student achievement. Washington, DC: U.S. Department of Education, Institute of Education Sciences, National Center for Education Evaluation and Regional Assistance, Regional Educational Laboratory Northeast and Islands.

Benchmark assessments, which are usually formative in nature, are promising because they give timely and useful student metrics in a way that corresponds to state standards. Little research, however, has been done to clarify any correlations between benchmark assessments and student achievement. In this study, the authors performed a quasi-experimental design to analyze the effects of a quarterly benchmark assessment program on mathematics achievement in a sample of Massachusetts eighth-graders. The study schools were 22 middle schools implementing a formative quarterly benchmark system and 44 matched control schools. Student mathematics achievement was determined via the Massachusetts Comprehensive Assessment System (MCAS). Data from the experimental and control schools were compared at the one-year post-implementation mark. The results show no statistically significant effect of benchmark assessments on mathematics gains. The authors caution, however, that the lack of effect may be due to the short period of follow-up.

Perie, M., Marion, S., & Gong, B. (2009). Moving toward a comprehensive assessment system: A framework for considering interim assessments. Educational measurement: Issues and practice, 28(3), 5–13.

The need to measure student performance throughout the year has resulted in an influx to states and districts of assessment products labeled with catchphrases such as "benchmark," "diagnostic," "formative," and "predictive." These types of products are considered interim assessments and can be integral to a state's or district's comprehensive assessment system. In many cases, however, there is little evidence demonstrating that commercially available interim assessment systems positively affect student achievement. This report strives to develop a framework that district and state leaders can use to evaluate these systems for purchase or development.

Accountability Assessments (To Determine Proficiency)

Beatty, A. (2010). Best practices for state assessment systems: Part I—Summary of a workshop. Atlanta, GA: National Academies Press.

Data from state assessment systems is used for a multitude of purposes. Through the years, state assessment systems have attempted to collect data in a variety of manners; from portfolios to multiple choice items, the options have run the gamut. This book provides insight and perspective on state assessment systems. The information is provided through the summary of two workshops designed to assist those working on these projects. 

Crawford, L., Tindal, G., & Stieber, S. (2001). Using oral reading rate to predict student performance on statewide achievement tests. Educational Assessment, 7(4), 303–323.

Curriculum-based measurement (CBM) is a progress-monitoring method that allows teachers to look at student learning over time. CBM can provide important supplementary information to the data garnered from statewide assessments, which are generally given infrequently and do not provide high-resolution insight into student learning. The authors present a longitudinal study examining correlations between the CBM practice of measuring oral reading rate and student performance on statewide reading and mathematics assessments. The study population consisted of 51 students in six blended classrooms in a rural Oregon school district; these students participated in the study during their second- and third-grade years. To assess reading rate, the students were given 200–250 word passages to read; reading rate was calculated as the number of correct words read aloud per minute. Findings show that there is an average increase in reading rate of 42 words per minute between Grades 2 and 3; this reading rate increase is independent of the benchmark reading rate assessed at the second-grade level. Moreover, moderate correlations exist between reading rate and performance on statewide mathematics and reading assessments. Importantly, all second-grade students reading at a threshold reading rate passed the statewide reading assessment the next year; 94 percent of third-grade students reading at a threshold rate passed the statewide reading assessment the same year. The results support the use of CBM methods to predict performance on and supplement information from large-scale statewide assessments.

Goodman, S. F., & Turner, L. J. (2010). Teacher incentive pay and educational outcomes: Evidence from the New York City bonus program. New York: Columbia University.

Many industries successfully make use of performance-based pay systems to increase worker productivity and effectiveness. Within the public school sector, however, performance-based pay systems are rare. In 2007, New York City Public Schools implemented the Schoolwide Performance Bonus Program, by which schools were awarded monetary bonuses if they met state achievement goals in mathematics and reading. These bonuses were awarded on a graded scale: if 100 percent of the achievement goals were met, teachers each received $3,000; if 75 percent of the achievement goals were met, teachers each received $1,500. This study examined the effects of the bonus system on student achievement, teacher absenteeism, classroom activities, and teacher quality. The schools in this quantitative study included 181 K–8 schools selected to participate in the bonus system and 128 schools selected to serve as controls. The authors find minimal effects of the bonus system on any of the examined metrics. The authors noted the occurrence of free-riding, particularly in large schools that lacked faculty cohesiveness. A small positive effect on mathematics achievement was observed in schools in which free-riding was at a minimum. Overall, this study does not support the use of performance-based pay systems in public school systems as a means to improve student achievement.

Somers, M., Zhu, P., & Wong, E. (2011). Whether and how to use state tests to measure student achievement in a multi-state randomized experiment: An empirical assessment based on four recent evaluations. Washington, DC: U.S. Department of Education, Institute of Education Services, National Center for Education Evaluation and Regional Assistance.

In many impact studies, student achievement is assessed by tests developed for the express purpose of the impact study. This study examines if state assessment results can be used as data points in impact evaluations. The authors reviewed four randomized studies on the use of state assessments in impact evaluations. The results show that, due to the breadth of content covered on state tests, these assessments can reliably be used to measure general impacts. Though state tests were less reliable than corresponding study-administered tests, this differential in reliability did not reach statistical significance. By contrast, state tests are not effective if used to measure impacts on specific achievement. An additional finding was that, though state tests differ in content, pooling of state test data across different states does not affect the outcome of impact evaluations. Finally, state tests are as effective as study-administered tests in establishing reliable student baseline data points. Overall, this study suggests that state tests can be used as complements, but not replacements, for study-administered tests in impact evaluations.

Computer Adaptive Tests

Computer adaptive testing (CAT) refers to the use of technology and an item selection algorithm to administer an assessment that tailors the test to specific characteristics of the test taker. Adaptive testing uses the patterns of student responses to select the next item and can yield more precise estimates of student ability. CAT makes it possible to administer more advanced, more complex items to the subset of students for whom they are appropriate, as well as administer items that are appropriate to the subset of students on the lower end of the ability level. There are several advantages with CAT: (1) scores can be determined immediately and more precisely than paper-pencil tests; (2) test administration is consistent; (3) there is flexibility in format and timing; and (4) may increase efficiency by reducing the amount of testing time since student's are not spending time answering questions that may be too easy or too difficult.

The International Association for Computerized Adaptive Testing (IACAT) is an organization whose mission is to advance the science and practice of human assessment (http://www.iacat.org/).

Davey, T. (2011). A guide to computer adaptive testing systems. Washington, DC: Council of Chief State School Officers, Technical Issues in Large-Scale Assessment (TILSA), State Collaborative on Assessment and Student Standards.

This guide is intended to help those responsible for procuring Computer Adaptive Testing (CAT) systems make better informed decisions by listing questions that will assist them in teasing out the differences between competing CAT delivery systems. In uncovering the many technical and subtle differences, it is the author's intent that the reader will understand the effects on quality, comparability, and usefulness of test scores that a system can produce.

Finkelman, M., & Roussos, L. A. (2009). Conditional exposure control method for multidimensional adaptive testing. Journal of Educational Measurement, 46(1), 84–103.

In computerized adaptive testing (CAT), ensuring the security of test items is a crucial practical consideration. A common approach to reducing item theft is to define maximum item exposure rates, i.e., to limit the proportion of examinees to who a given item can be administered. Numerous methods for controlling exposure rates have been proposed for tests employing the unidimensional 3PL model. The present article explores the issues associated with controlling exposure rates when a multidimensional item response theory (MIRT) model is utilized and exposure rates must be controlled conditional upon ability. This situation is complicated by the exponentially increasing number of possible ability values in multiple dimensions. The article introduces a new procedure, called the generalized Stocking-Lewis method that controls the exposure rate for students of comparable ability as well as with respect to the overall population. A realistic simulation set compares the new method with three other approaches: Kullback-Leibler information with no exposure control, Kullback-Leibler information with unconditional Sympson-Hetter exposure control, and random item selection.

Wauters, K., Desmet, P., & Van den Noortgate, W. (2010). Adaptive item-based learning environments based on the item response theory: possibilities and challenges. Journal of Computer Assisted Learning, 26(6), 549–562.

The popularity of intelligent tutoring systems (ITSs) is increasing rapidly. In order to make learning environments more efficient, researchers have been exploring the possibility of an automatic adaptation of the learning environment to the learner or the context. One of the possible adaptation techniques is adaptive item sequencing by matching the difficulty of the items to the learner's knowledge level. This is already accomplished to a certain extent in adaptive testing environments, where the test is tailored to the person's ability level by means of the item response theory (IRT). Even though IRT has been a prevalent computerized adaptive test (CAT) approach for decades and applying IRT in item-based ITSs could lead to similar advantages as in CAT (e.g., higher motivation and more efficient learning), research on the application of IRT in such learning environments is highly restricted or absent. The purpose of this paper was to explore the feasibility of applying IRT in adaptive item-based ITSs. Therefore, we discussed the two main challenges associated with IRT application in such learning environments: the challenge of the data set and the challenge of the algorithm. We concluded that applying IRT seems to be a viable solution for adaptive item selection in item-based ITSs provided that some modifications are implemented. Further research should shed more light on the adequacy of the proposed solutions.

Zenisky, A. L., & Sireci, S.G. (2002). Technological innovations in large-scale assessment. Applied Measurement in Education, 15(4), 337–362.

Advances in computer technology have substantially influenced the ways in which tests are made, administered, scored, and reported to examinees. These changes are particularly evident in computer-based testing, where the use of computers has allowed test developers to re-envision what test items look like and how they are scored. By integrating technology into assessments, it is increasingly possible to create test items that can sample as broad or as narrow a range of behaviors as needed while preserving a great deal of fidelity to the construct of interest. Some of the current technological developments in computer-based testing is reviewed an illustrated, focusing on novel item formats and automated scoring methodologies.

Evidence-Based Design of Assessment

Bransford, J. D., Brown, A. L., & Cocking, R. R. (Eds.) (2004). How people learn: Brain, mind, experience, and school. Washington, DC: National Academy Press.

This book analyzes the science of learning in infants, educators, experts, and people in general and discusses its implications relative to what we teach, how we teach it, and how we assess what our children learn. By understanding the theories and insights of learning, teachers can make connections between learning behaviors and classroom activities and increase the impact of in-depth learning.

Shepard, L. A. (2000). The role of assessment in a learning culture. Educational Researcher, 29(7), 4–14.

The changes in curriculum, learning theory, and assessment from a behaviorist and scientific perspective to a more cognitive, constructivist, and sociocultural perspective highlight the discrepancy between new views of instruction and traditional views of testing. This article elaborates on the changes needed in order for assessment practices to align with and support the social-constructive model of present day teaching and learning. The author also considers the need for teacher education programs to address the issues surrounding the use of assessment.

Castellano, K. E., & Ho, A. D. (2013). A practitioner's guide to growth models. Washington, DC: Council of Chief State School Officers. Retrieved from http://scholar.harvard.edu/files/andrewho/files/a_pracitioners_guide_to_growth_models.pdf

As statistical models and accountability systems become increasingly complex, the interpretation of the resulting growth models frequently does not align with one's intuition. Through a guidebook structure, the authors hope to help users increase alignment between user interpretations and growth model function. The publication presents a practical guide to understanding growth models by establishing naming conventions, grouping models by similarities and contrasts, listing critical questions or analytical lenses through which current or proposed growth models should be applied, and then using each of those lenses to systematically delve into each model.

Assessment & Special Populations

Abedi, J. (2009). Computer testing as a form of accommodation for English language learners. Educational Assessment, 14(3–4), 195–211.

Research shows that performance gaps on assessments in all content areas exist between English language learners (ELLs) and non-ELLs. Interestingly, work also shows that this performance gap between ELLs and non-ELLs decreases as the complexity of the language used on any given assessment decreases. This suggests that the language of assessments themselves may be obscuring the true academic performance of ELLs in content areas. In this quantitative study, the author examines the effects of various testing accommodations on reducing the performance gap between ELLs and non-ELLs; the author also analyzes the validity and feasibility of such accommodations. Accommodations studied were: a computer-based pop-up glossary, a customized English dictionary, extra testing time, and small-group testing. The student population consisted of ELL and non-ELL fourth- and eighth-grade students from one urban public school district in southern California. A set of reading measures was administered to measure English reading proficiency. Students were then given a standardized mathematics assessment and were assigned randomly to either an accommodation cohort or a control cohort that did not receive testing accommodation. The results show that in Grade 4, extra time and computer testing increase ELL mathematics assessment scores while non-ELL scores are not affected. In Grade 8, only computer testing is effective in reducing the ELL or non-ELL performance gap. Across both grades, the pop-up glossary was extensively used by ELLs. Since the accommodations do not lead to changes in non-ELL scores, assessment validity is not impacted.

Abedi, J. (2002). Standardized achievement tests and English language learners: Psychometrics issues. Educational Assessment, 8(3), 231–257.

Given the increasing number of English language learners (ELLs) in U.S. classrooms, there is a growing need to evaluate how such students can be included fairly within large-scale assessments. Prior research shows that ELL achievement on assessments may be obscured by their difficulty in processing the assessment items linguistically. That is, ELL achievement on content assessments may be masked by language difficulties related to the test items themselves. This study looks at the relationship between language background and student achievement on standardized assessments. Standardized test data and student background information were collected from four anonymous locations in the United States. The results show that ELL assessment performance in reading, science, and mathematics is lower than that of non-ELLs. Moreover, increasing linguistic complexity of test items correlates with an increasing disparity between ELL and non-ELL performance. This language proficiency-achievement interaction is most pronounced in content areas, like reading, that necessitate a higher level of language comprehension. Conversely, the performance gap between ELLs and non-ELLs is least pronounced in the science, technology, engineering, and mathematics (STEM) fields, in which language proficiency potentially plays a reduced role in content mastery.

Allen, S. F. (2007). Assessing the development of young children in child care: A survey of formal assessment practices in one state. Early Childhood Education Journal, 34(6), 455–465.

In Florida, Early Learning Coalitions (ELCs) are state-created agencies that manage, at the local-level, subsidized child care and school readiness preparation for preschool children. Given the increasing focus on the early identification of learning delays in children and the relative dearth of literature on how early childhood assessments are implemented at the local level, the authors surveyed ELCs in Florida regarding their implementation of formal assessments. ELC chairs were asked to complete an online survey on the formal assessment instruments used in their ELC. The results from the survey analysis show a wide diversity in the number and types of instruments used in each ELC. Many ELCs used formal assessment instruments, including screening tools and multiple-domain tools. In all districts, classroom teachers were involved in administering at least one formal assessment system. Few ELCs employed formal assessments that included parental input or were tied to the curriculum. The authors posit that the extreme diversity of implementations of formal assessment systems is an indicator of the complexity of using formal assessment systems at the early childhood level.

Downs, A., & Strand, P. S. (2006). Using assessment to improve the effectiveness of early childhood education. Journal of Child and Family Studies, 15, 671–680.

Early childhood education, early intervention, and early childhood special education (ECE/ECSE) programs have been growing in importance in the United States since federal statutes of the 1960s required the identification and targeting of educational at-risk preschool-age children. This article provides a meta-analysis of three assessment approaches (traditional standardized testing; teacher-rating and work sampling; and curriculum-based assessment and measurement) as they relate to ECE/ECSE programs. Traditional standardized testing allows teachers to see their students' relative standings to other peers but gives little information to help teachers devise effective interventions. Teacher-rating and work sampling are similarly limited in their ability to inform interventions but can be used to identify at-risk students. Curriculum-based assessment and measurement can also identify at-risk students and, moreover, can allow teachers to assess learning development over time. The authors conclude with five principals to help ECE/ECSE programs choose relevant assessment systems that can best inform interventions and improve student learning.

Jia, Y., Eslami, Z. R., & Burlbaw, L. M. (2006). ESL teachers' perceptions and factors influencing their use of classroom-based reading assessment. Bilingual Research Journal, 30(2), 407–430.

As a result of the statewide assessments mandated by the No Child Left Behind Act, there is fear that these high-stakes statewide assessments reduce rather than increase instructional quality. High-stakes statewide assessments are administered infrequently and generally do not provide teachers with sufficient information to modify their instructional practices. Classroom-based reading assessments, on the other hand, have been shown to assist teachers in identifying student needs and making corresponding adjustments in classroom practices. This study examines ESL teacher perceptions and usages of classroom-based reading assessments. The study population consisted of seven elementary and six middle-school ESL-endorsed language arts and reading teachers from nine schools and four districts in Houston, Texas. Data was collected via observations, interviews, and analysis of classroom documents. The results show that teachers highly value classroom-based reading assessments because they were able to generate immediate, discrete, and continuous student achievement information. By contrast, teachers regard statewide tests as less effective in the context of everyday teaching. Statewide tests are perceived to impinge upon classroom-based reading assessments. Finally, the authors find that teachers lack decision-making power because the use of classroom-based reading assessments is under the direction of district and school authorities rather than the teachers themselves. Overall, the study results promote increased district and school awareness and support of classroom-based assessment.

Kieffer, M. J., Rivera, M., & Francis, D. J. (2012). Practical guidelines for the education of English language learners: Research-based recommendations for the use of accommodations in large-scale assessments. Portsmouth, NH: RMC Research Corporation, Center on Instruction.

As the English language learner (ELL) population increases in American schools, it is becoming increasingly important to consider how ELLs can be accounted for fairly within the context of large-scale assessments. The authors present a meta-analysis of research on testing accommodations for ELLs on large-scale assessments. The authors considered 20 studies that, in aggregate, analyzed nine testing accommodations. When the study populations of the 20 studies were pooled together, greater than 9,400 ELLs were represented in a total student population of  more than 33,000. The results of the meta-analysis show three testing accommodations have modest but statistically significant effects on ELL performance on large-scale assessments: simplified English on assessment items can reduce the ELL or non-ELL performance gap by 9–19 percent; use of English dictionaries and glossaries can reduce 11–21 percent of the performance gap; and allowing ELLs additional testing time can reduce 15–31 percent of the performance gap. The authors recommend using such testing accommodations when assessing ELLs but emphasize that accommodations should be a small part of a much larger strategy of integrating ELLs into classroom instruction and assessment.

Lonigan, C. J., Allan, N. P., & Lerner, M. D. (2011). Assessment of preschool early literacy skills: Linking children's educational needs with empirically supported instructional activities. Psychology in the Schools, 48(5), 488–501.

Data from nationwide summative content area assessments indicate a clear dearth of literacy skills among a large section of American school children. Moreover, longitudinal studies show that preschool-level literacy correlates with literacy level in the elementary years and beyond. As such, it is important to identify literacy-related deficits early in a child's education. In this paper, the authors analyze the pros and cons of three types of preschool-level early literacy assessments: informal, diagnostic, and screening or progress-monitoring. The authors find that both informal and screening or progress-monitoring assessments, though simple to implement, fail to provide detailed insight into specific domains of students' literacy needs. Diagnostic assessments provide this level of resolution but carry the price of a much greater time and monetary cost. As such, the authors propose a two-step process of preschool-level early literacy assessment. Informal and screening/progress-monitoring assessments should be used as a first step to broadly identify students performing at a low literacy level. Once such students are identified, the more intensive diagnostic assessments should be used to determine the precise domains of literacy in which low-literacy students are lacking.

O'Bryon, E. C., & Rogers, M. R. (2010). Bilingual school psychologists' assessment practices with English language learners. Psychology in the Schools, 47(10), 1018–1034.

English language learners (ELLs) in America now account for 20 percent of all students. At present, little is known about how ELL language proficiency assessments are given in schools, or even who among school staff is principally responsible for administering such assessments. This study sought to clarify the assessment practices of bilingual school psychologists. The study population consisted of 276 bilingual school psychologists distributed nationwide. The study participants were given a survey that asked about their level of education, language background, state and district ELL context, knowledge of language proficiency assessment practices, personal usage of assessment methods, and personal usage of interpreters. The analysis shows that most psychologists use multifaceted, comprehensive approaches to language proficiency assessment; most of these approaches included acculturation measures. Moreover, participation in continuing education predicts the use of best practices in the areas of assessing language proficiency, using interpreters, and implementing assessments with shared-language students. Finally, many participants cite a lack of comfort in performing assessments on language proficiency, suggesting that targeted training in this area is needed for school personnel.

Parker, C. E., Louie, J., & O'Dwyer, L. (2009). New measures of English language proficiency and their relationship to performance on large-scale content assessments. Washington, DC: U.S. Department of Education, Institute of Education Sciences, National Center for Education Evaluation and Regional Assistance, Regional Educational Laboratory Northeast and Islands.

There is an increasing need to accommodate and include English language learners (ELLs) fairly in large-scale statewide assessments. This study examined if English language proficiency measures can predict student achievement on content-area assessments. The study population consisted of fifth- and eighth-grade ELLs in New Hampshire, Rhode Island, and Vermont. English language proficiency in four domains (listening, speaking, reading, and writing) was assessed via the Assessing Comprehension and Communication in English State-to-State for English Language Learners (ACCESS for ELLs) assessments. Content-area knowledge was assessed via the New England Common Assessment Program (NECAP) assessments. After controlling for student and school covariates, the authors find that NECAP reading scores correlate significantly with ACCESS reading, writing, and speaking scores for both grades; NECAP writing scores correlate significantly with ACCESS reading and writing scores for fifth graders and all domains for eighth graders; NECAP mathematics scores correlate significantly with ACCESS reading and writing scores for both grades. Overall, the study suggests that proficiency in the literacy-related domains predicts content-area performance better than proficiency in the oral-related domains of English.

Pray, L. (2005). How well do commonly used language instruments measure English oral-language proficiency? Bilingual Research Journal, 29(2), 387–409.

The outcomes of language assessments play significant roles in assigning students to language support and special education programs. As such, particular attention should be paid to the validity of commonly used language assessments. Beginning in 2001, several studies reported that common language assessments classified students as non-native speakers though samples of their natural language suggested the contrary. This study examines the validity of language assessments by administering them to native English speakers, whose performance reflects the assessments' extent of construct validity. The study instruments were oral versions of three commonly used language assessments: the Woodcock-Munoz Language Survey (WMLS), IDEA Proficiency Test (IPT), and Language Assessment Scales-Oral (LAS-O). Study participants consisted of 40 elementary (Grades 4 and 5) students in an urban school district in the southwest United States. The results show that none of the native English-speaking students were classified as fluent in English by the WMLS, 87 percent of native speakers were classified as fluent by the IPT, and all native speakers were classified as fluent by the LAS-O. Moreover, the author notes that scores on the LAS-O vary widely among different graders. As such, the construct validity of the WMLS, IPT, and LAS-O are called into question.

Rivera, C., & Collum, E. (2004). An analysis of state assessment policies: Addressing the accommodation of English language learners. Arlington, VA: The George Washington University, Center for Equity and Excellence in Education.

As English language learner (ELL) populations increase in the United States and federal and state governments seek to include such students fairly in large-scale assessments, it is becoming increasingly important to examine the research base focusing on testing accommodations for ELL students. The authors review 15 experimental or quasi-experimental studies looking at accommodation effects on ELL or non-ELL assessment performance. The authors find that several accommodations work particularly well in the context of ELL assessment: native language versions of assessments, simplified English-language versions of assessments, and a combination of direct linguistic support accommodations with indirect accommodations like increased testing time. The authors also examine state ELL assessment accommodation policies for 2000–01 and note that, in aggregate, states list 75 accommodations for ELLs, of which 44 directly addressed ELL needs. Twenty-eight states provide guidelines to inform how students should be identified to receive accommodations. Overall, the authors note that many state policies related to ELL testing accommodations fail to comprehensively guide the process of identifying and accommodating students who require assessment modifications.

Sato, E., Rabinowitz, S., Gallagher, C., & Huang, C. W. (2010). Accommodations for English language learner students: The effect of linguistic modification of math test item sets. Washington, DC: U.S. Department of Education, Institute of Education Sciences, National Center for Education Evaluation and Regional Assistance.

Both theory and research findings suggest that English language learner (ELL) performance on mathematics assessment may be limited not by lack of mathematics content knowledge but by the linguistic complexity of the assessment items themselves. This study examined how ELL achievement on mathematics assessments is affected by linguistic modification of the assessment items. The assessment instruments consisted of two sets of 25 multiple choice questions, one set of which was modified linguistically. Assessment items covered a broad array of mathematics concepts, including algebra, geometry, data analysis, and number operations. Three groups of students in Grades 7 and 8 (ELLs, non-ELA-proficient non-ELLs, and ELA-proficient non-ELLs) were randomly assigned to one of the assessment instruments; 4,617 total students participated in the study. Performance was calculated with four scoring approaches: summed number correct and three-item response theory models (1-PL, 2-PL, and 3-PL). Principal findings are that the average performance differential between the non-modified and linguistically modified assessments is largest for ELLs. Furthermore, the linguistic modifications demonstrate validity by not changing the mathematics construct meant to be assessed. Finally, the differences in achievement across groups differ according to the scoring methodology used.

Schappe, J. F. (2005). Early childhood assessment: A correlational study of the relationships among student performance, student feelings, and teacher perceptions. Early Childhood Education Journal, 33(3), 187–193.

Extant assessment instruments used for the early childhood population fail to cover both the complexity of student thought and the variance in student performance. Many such assessments consist of simple checklists or anecdotal teacher notes. An alternative assessment methodology is performance assessment, in which students are engaged actively in a task and are rated based on scalable criteria. This study examines student performance, student feelings, and teacher perceptions of a formal performance assessment. The study population consisted of 71 students 56–71 months from one suburban school and their teachers. The teachers were administered a perception survey for each of their students; these surveys asked the teachers to rate the extent to which a given student exhibited a certain skill or behavior. Researchers then administered to the students a performance assessment centered on building a neighborhood community. Finally, students were administered post-test self-assessments that examined student feelings towards the assessment. The researchers find significant positive correlations between student performance and teacher perception. The researchers also discover that no significant correlations exist between student feelings and either student performance or teacher perceptions, indicating that at the early childhood stage, affective measures do not predict performance. Overall, the study demonstrates the usefulness of performance assessment in the early childhood population.

Stevens, R. A., Butler, F. A., & Castellon-Wellington, M. (2001). Academic language and content assessment: Measuring the progress of English language learners. Los Angeles, CA: University of California, National Center for Research on Evaluation, Standards, and Student Testing.

Given the increasing use of large-scale assessments and the growing English language learner (ELL) population in the United States, there is an intensive need to analyze how ELLs can be included fairly within administrations of large-scale assessments. In this paper, the authors examine the correlations between ELL student performance on a language proficiency assessment and a social studies content assessment. The study population consisted of 102 ELL and 19 English-only (EO) students spread across six seventh-grade sheltered social studies classrooms in a southern California school district. The language proficiency assessment used was the Language Assessment Scales (LAS) Reading Component, Form 3A; the content assessment used was the Iowa Tests of Basic Skills (ITBS) Social Studies Test for Seventh Grade, Form L. Overall, the authors determine that the language and linguistic difficulty of the ITBS is far greater than those of the LAS. Findings show that ELL performance on the LAS and ITBS are weakly correlated. ELLs with the highest ITBS scores had higher LAS scores than ELLs who performed poorly on the ITBS. Moreover, EOs had better average scores on both the LAS and ITBS than ELLs. Interestingly, the subgroups of ELLs with the highest ITBS scores responded to ITBS test items similarly to EOs. This suggests that these ELLs were limited in their content achievement not by a lack of language knowledge but by a lack of content knowledge. For the ELLs who scored poorly on the ITBS, both language and content knowledge deficits likely contributed to their low scores.

Wolf, M. K., Farnsworth, T., & Herman, J. (2008). Validity issues in assessing English language learners' language proficiency. Educational Assessment, 13, 80–107.

One of the stipulations of the No Child Left Behind (NCLB) Act is that states must implement valid assessments to evaluate the English language proficiency (ELP) of English language learners (ELLs). It is challenging, however, for states to validate the assessment systems they have implemented in response to NCLB. This article reviews issues related to the validity of ELP assessments. Publicly available information was gathered for 49 states and Washington, DC for the ELP assessments and associated validities each state used during the 2006–07 school year. The data indicated that 43 states implemented their ELP assessments within the prior five years. Only 13 assessments, however, were accommodated by publicly available validation studies. Several issues are considered in the context of validating ELP assessments, including identifying the purpose of the assessment and aligning the assessment items with statewide standards and instruction. The authors conclude that there is a need for additional validation work on state ELP assessment systems.

Assessment & Cross-Cutting Issues

Black, P. (2000). Research and the development of educational assessment. Oxford Review of Education, 26(3–4), 407–419.

The author presents examples of how research studies have affected policy and practice related to formative assessment, high-stakes summative tests, and large-scale assessment surveys. In the area of formative assessment, for example, research strongly supports the efficacy of such assessments in increasing student achievement. These research studies, however, have not translated into persistent classroom implementation of formative assessments. This lack of implementation is partly due to the absence of coordination between different researchers in the field. In the area of high-stakes summative tests, the lack of understanding of reliability- and validity-related research has retarded research's impact on practice. The author concludes that research effects on policy and practice in educational assessment have been inconsistent and weak. The author cites the absence of research coordination among studies and the absence of an organizing research center as principal contributors to the weak effect of research on the policy and practice of assessment.

Brown, R. S., & Coughlin, E. (2007). The predictive validity of selected benchmark assessments used in the mid-Atlantic region. Washington, DC: U.S. Department of Education, Institute of Education Sciences, National Center for Education Evaluation and Regional Assistance, Regional Educational Laboratory Mid-Atlantic.

Many states are implementing additional assessments to supplement the information from end-of-year summative state assessments. In many cases, states and districts seek to use these supplemental assessments to predict and improve performance on end-of-year summative state assessments. It is important to establish validity for these supplemental assessments. This report examines benchmark assessments' ability to predict achievement on end-of-year state assessments. Four commercially available benchmark assessments were evaluated: Measures of Academic Progress (MAP), STAR Math/STAR Reading, Study Island, and TerraNova. All of these benchmark assessments were used by the study states, which consisted of Delaware, the District of Colombia, Maryland, New Jersey, and Pennsylvania. The authors reviewed the validity of the benchmark assessments via published technical materials and additional information provided by the assessment publishers. Overall, the authors find little evidence to support the validity of the benchmark assessments in terms of predicting state assessment scores. Only TerraNova shows strong predicative validity; even then, TerraNova is only predictive of performance on the Pennsylvania state assessment for Grades 5, 8, and 11. The authors conclude that much additional work is needed to clarify the predictive validity of commercially available benchmark assessments on statewide achievement tests.

Clarke-Midura, J., & Dede, C. (2010). Assessment, technology, and change. Journal of Research on Technology in Education, 42(3), 309–328.

Traditional assessment methods are inadequate for evaluating higher-level cognitive skills. The authors argue that assessment still has not, on a large scale, incorporated recent technological advances and current knowledge about human cognition. The authors provide an overview of research into the use of technology to enhance and improve assessment so that students' higher-level cognitive skills can be effectively evaluated. The authors then summarize their own work, which involved quasi-experimental studies to examine if realistic virtual reality simulations can increase student achievement in science. Results indicate that such virtual simulations do indeed allow students to exercise authentic inquiry and enhance their engagement and efficacy with the tested subject matter. The authors also report a mismatch between student learning as demonstrated on multiple choice post-tests versus student learning as demonstrated via virtual reality-based assessments. These findings suggest that virtual reality-enhanced assessments and traditional multiple choice assessments assess different aspects of student learning.

Darling-Hammond, L., & Adamson, F. (2010). Beyond basic skills: The role of performance assessment in achieving 21st century standards of learning. Stanford, CA: Stanford University, Stanford Center for Opportunity Policy in Education.

To best prepare students for modern academic and professional demands, appropriate assessment should go beyond standard multiple choice exams. Assessments must be able to evaluate higher-order levels of thinking and large-scale summative assessments generally fail to achieve this goal. The authors provide a summary of research and lessons learned on performance assessments, which are formative, non-multiple choice assessments designed to provide deeper insight into students' higher-level cognitive skills. After providing definitions and concepts related to performance assessments, the authors cite salient examples of performance assessments being implemented across the country and the world. The authors then frame performance assessments within Bloom's taxonomy of the cognitive domain—namely, performance assessments target the top half of the taxonomy. Finally, the authors discuss practical challenges and potential solutions associated with implementing performance assessments.

Darling-Hammond, L. (2010). Performance counts: Assessment systems that support high-quality learning. Washington, DC: Council of Chief State School Officers.

The Common Core State Standards were devised with the need to remain competitive with high-quality international students in mind. This white paper envisions an ideal assessment system built upon best practices from research and empirical evidence garnered at the national and international levels. The author offers an analysis of high-achieving international education systems, including those in Australia, Hong Kong, and Singapore. The author concludes that such systems make extensive use of multiple methods of open-ended assessments that are deeply embedded into curricula. Moreover, teachers are highly involved in curricular development and assessment scoring. Finally, assessments in high-achieving education systems are connected with well-defined curricular frameworks and are characterized by strong benchmark metrics. Given these research- and evidence-based best practices, the author provides guidelines to devise an ideal assessment system applicable to the United States. Key components of such a system include strong state leadership roles, clear curricular frameworks, digital libraries of curricular and assessments, extensive professional development, and use of technology.

Eyal, L. (2012). Digital assessment literacy—The core role of the teacher in a digital environment. Educational Technology and Society, 15(2), 37–49.

This article defines and characterizes "digital assessment literacy," a new term in educational literature. Digital assessment literacy encompasses the competencies and perceptions that modern teachers must have in order to adapt to the novel assessment needs and teaching approaches necessitated by the digital environment. The digital environment offers benefits to both students and teachers. For teachers, digitalization of assessment data streamlines the assessment process, thus reducing teacher burden. For students, novel digital assessment options, including digital portfolios, forums, online peer assessments, blogs, and wikis, push students to move beyond the algorithmic, lower-level thinking required by standard multiple choice tests. In the digital environment, the pedagogical focus falls on formative, short performance tasks rather than summative, large assessments. The author proposes three levels of digital assessment literacy: basic, intermediate, and advanced. Basic digital assessment literacy involves the proper use of a learning management system (LMS). An LMS provides teachers with immediate access to comprehensive student assessment information that can then be used to give rapid and personalized feedback to students, teachers, administrators, and parents. Intermediate digital assessment literacy requires a teacher to be able to use a variety of digital tools, assessment types, and technologies. Intermediate digital assessment literacy moreover requires a growing awareness of the security and ethical risks associated with digital learning. Advanced digital assessment literacy involves teachers disseminating their teaching and assessment methods with the broader teaching community. Advanced digital assessment literacy also involves the ability to engage students in cooperative learning and peer evaluation.

Herman, J., & Linn, R. (2013). On the road to assessing deeper learning: The status of Smarter Balanced and PARCC assessment consortia. Los Angeles, CA: University of California, National Center for Research on Evaluation, Standards, and Student Testing.

This report analyzes the potential effects of the Smarter Balanced Assessment Consortium (Smarter Balanced) and Partnership of Readiness for College and Careers (PARCC) systems on the depth of student learning. Both Smarter Balanced and PARCC seek to develop assessments aligned with the Common Core State Standards. The authors assess the possible effects of Smarter Balanced and PARCC on the depth of student learning in English language arts and mathematics. The depth of knowledge (DOK) scheme, which divides DOK into four levels (DOK1–4, with depth of knowledge increasing from DOK1 to DOK4), was used to analyze the extent to which Smarter Balanced and PARCC would promote deeper learning. Through analysis of Smarter Balanced's content specifications and sample items, the authors find that 49 percent of the system's mathematics assessment targets are at DOK3 and 21 percent are at DOK4; furthermore, 43 percent of English-language-assessment targets are at DOK3 and 25 percent are at DOK4. Compared to the depth of knowledge tested by standard state assessments, Smarter Balanced analyzes far more knowledge depth. The authors did not provide a detailed DOK breakdown of PARCC but concluded from analysis of the initial item design framework and proposed test items that PARCC would also test deep learning to a larger extent than current state assessments. Overall, the authors conclude that 100 percent of students taking Smarter Balanced or PARCC assessments will be probed on their depth of knowledge.

Nguyen, D. M., Hsieh, Y. C. J., and Allen, G. D. (2006). The impact of web-based assessment and practice on students' mathematics learning attitudes. Journal of Computers in Mathematics and Science Teaching, 25(3), 251–279.

Web-based learning provides rapid feedback, a variety of question types, automated student diagnosis, and administrative relief for teachers. Moreover, prior work suggests that web-based learning renders mathematics more enjoyable and engaging for students. The authors describe a quasi-experimental qualitative and quantitative study examining the effects of web-based assessment and practice on the attitudes of middle school mathematics students. The authors also examined any differences in learning attitudes across gender and ethnic lines within the study population. Study participants consisted of 74 fourth-grade mathematics students from a South Texas middle school; all students were taught by the same teacher. Students were randomly assigned into a web cohort or a traditional cohort. Both cohorts practiced the study instrument, which consisted of four randomized practice question banks from the Connected Mathematics series. The web cohort took the questions via computer; the traditional cohort took the questions via standard paper-and-pencil delivery. In addition to the study instrument, written pre- and post-surveys and post-interviews were administered to evaluate student attitudes towards computers, mathematics, and learning mathematics via computer. The study lasted for three weeks, during which students worked on the instrument three times per week for 30 minutes per day. The results demonstrate that students in the web cohort showed improvement in overall mathematics attitudes relative to traditional cohort students, whose attitudes remained stable throughout the study. Within the web cohort, Hispanics developed more confidence than whites in their mathematic ability, and males developed more confidence than females. White females developed stronger positive attitudes toward the usefulness of web learning in mathematics than African-American females. Overall, this study supports prior research showing that digital learning improves student attitudes towards mathematics learning.

Partnership for Assessment of Readiness for College and Careers (n.d.). PARCC assessment design. Washington, DC: Achieve, Inc.

The state of Illinois is a member of the Partnership for Assessment of Readiness for College and Careers (PARCC). PARCC is a consortium of 23 states and territories that is devising K–12 assessments in mathematics and English based upon the Common Core State Standards. These common assessments are intended to facilitate student preparedness for the collegiate and professional environments. This resource presents an overview of the PARCC assessment design. The PARCC assessment system addresses six priority purposes ranging from assessing all Common Core Standards to providing accountability metrics. The resource also describes the multiple components of PARCC, including summative and non-summative assessments, all of which are designed to be administered via computer.

Pellegrino, J. W. (2004). The evolution of educational assessment: Considering the past and imagining the future. Princeton, NJ: ETS, Policy Evaluation and Research Center, Policy Information Center.

The author looks at a variety of issues that have impacted the past century of assessment in American education and considers their influence on both where the field has been and where it should be headed. After reviewing past policies and practices that have shaped the educational assessment community, the article explores solutions to current concerns and identifies major challenges that the educational assessment community will have to face to significantly improve approaches to educational assessment.

Pellegrino, J. W. (2006). Rethinking and redesigning curriculum, instruction and assessment: What contemporary research and theory suggests. Washington, DC: National Center on Education and the Economy.

The author developed a model of instruction by which a triad exists among curriculum, instruction, and assessment. The author argues to best enhance student learning, all three elements of this triad must be coordinated with each other. Aligning this triad, however, is difficult. This difficulty, the author posits, is attributable to the absence of an integrated theory of student learning. After providing a summary of critical principles regarding student learning and understanding, the author maintains that high-stakes summative assessments are at direct odds with the goal of enhancing student learning to meet increasing educational expectations. The author proposes an ideal assessment system of the future, which would use technology to achieve coordination across the educational triad of curriculum, instruction, and assessment.

Shavelson, R., Ruiz-Primo, M. A., Li, M., & Ayala, C. C. (2003). Evaluating new approaches to assessing learning. Los Angeles: University of California, Los Angeles, National Center for Research on Evaluation, Standards, and Student Testing.

The authors argue that new assessment approaches are needed because current assessments generally evaluate lower-level cognition and fail to capture high-level cognitive aspects of student learning. New assessment methods, including performance assessments and predict-observe-explain demonstrations, can better examine how students are able to apply their knowledge. The authors propose a framework to help evaluate the information obtained from these new assessment methods. The authors illustrate an "assessment square" consisting of construct, assessment, observation, and interpretation; four analyses—conceptual, logical, cognitive, and quantitative/qualitative—are associated with this assessment square and lead to interpretation of assessment data. The authors then illustrate how the assessment square can be applied to evaluating science assessment data.

Smith, J. K. (2003). Reconsidering reliability in classroom assessment and grading. Educational Measurement: Issues and Practice, 22(4), 26–33.

This article explores reliability in classroom assessment and grading and whether traditional measurement theory is appropriate for classroom settings. The author focuses on issues regarding high school grading and assessment and argues that relying on high school grades to make critical decisions about a student's future may be inappropriate from a reliability standpoint. An alternative idea of reliability, called "sufficiency of information," is proposed and explored. By using a "sufficiency of information" approach, teachers can decide whether or not they have collected enough information to provide accurate and meaningful measures of student performance, making their assessment more reliable.

Stiggins, R., & Chappuis, J. (2005). Using student-involved classroom assessment to close achievement gaps. Theory Into Practice, 44(1), 11–18.

Research shows that with the increase of district, state, and federal reliance on high-stakes, large-scale achievement tests comes an increase in dropout rates and a decrease in graduation rates as low-performing students encounter discouragement. The authors propose that a critical component of addressing the achievement gap is encouraging low-performing students to believe in their own abilities to succeed and persist educationally. The authors argue that involving students in ongoing formative classroom assessments can instill self-confidence in students and in turn lead to better student outcomes. The authors review the literature, showing that student involvement in ongoing classroom assessments does lead to better student achievement. The authors then propose four conditions that should be met in order to effectively implement assessments in a way that reduces the achievement gap.

Torgesen, J. K. (2006). A comprehensive K–3 reading assessment plan: Guidance for school leaders. Portsmouth, NH: RMC Research Corporation, Center on Instruction.

Regular reading assessments for students in the early grades, during which they are beginning to decode and comprehend text, are critically important because such assessments provide data on student progress and can help identify students who are not meeting progress goals in reading. The author presents a comprehensive assessment plan, based on research and practical evidence, for K–3 reading. The author proposes that a comprehensive assessment plan for early reading should be able to identify at-risk students, monitor student reading progress, collect information that can inform instructional modifications, and assess the effectiveness of instructional practices. Critical to a successful K–3 assessment system is regular monitoring of student levels of phonemic awareness, phonics, fluency, vocabulary, and reading comprehension. The author discusses four types of assessments that should be used in any comprehensive reading assessment plan: screening tests, progress monitoring tests, diagnostic tests, and reading outcome tests. Details are provided as to how to select appropriate assessment types, when such tests should be administered, to whom the tests should be given to, and who within school staff should administer them.

Grigorenko, E. L. (2009). Dynamic assessment and response to intervention: Two sides of one coin. Journal of Learning Disabilities, 42(2), 111–132.

The author provides a systematic theoretical comparison of dynamic testing and assessment (DTA) and response to intervention (RTI). Comparisons between DTA and RTI are made along the lines of context, premise, main concepts, process, types of evidence, and expertise requirements. DTA emerged in the 1920s with the goal of incorporating instruction into assessment. RTI emerged in the 1980s with the goal of using assessment to examine instructional efficacy. Overall, both DTA and RTI ultimately involve combining assessment and intervention holistically. Since DTA has a longer history than RTI, however, the experiences in developing and implementing DTA can inform the effective development and deployment of RTI. The author suggests that if the largely artificial barriers between the two constructs are removed, the rigor of both DTA and RTI can be enhanced.

Hughes, C., & Dexter, D. D. (2010). Field studies of RTI programs (Rev. ed.). Retrieved from http://www.rtinetwork.org/learn/research/field-studies-rti-programs

Though theory suggests that response to intervention (RTI) techniques are promising in identifying and addressing the needs of at-risk students, little research has been done on the practical effectiveness of the holistic RTI process when implemented in the field. The authors review 16 research studies examining the effectiveness of various RTI methods. The RTI methods examined by the studies ranged from the problem-solving model, in which interventions are modified for individual students, to the standard protocol model, in which interventions are preselected for use. The studies also examined RTI initiatives administered by local educational agencies and RTI programs designed and administered in school districts by non-district researchers. The authors find that all studies show RTI leads to improvement in student achievement. However, most studies focused on elementary-level reading; as such, additional work is needed to clarify RTI effects at other grade levels and in other content areas. The authors moreover find that RTI implementation does not increase referral or placement rates in special education environments. Finally, the authors identify common supports that allow for more reliable and effective implementation of RTI.

Koellner, K., Colsman, M., & Risley, R. (2011). Multidimensional assessment: Guiding response to intervention in mathematics. Teaching Exceptional Children, 44(2), 48–56.

Research suggests that mathematics achievement increases when instructional practices build upon students' pre-existing competencies. Multidimensional assessments are one possible way to uncover competencies in at-risk students. In this case study, the authors discuss Danny, a fourth-grade student in an affluent suburban school district. Danny scores below proficiency on his statewide mathematics assessment and performs below grade level in class; Danny's mathematics ability has not improved despite individual instruction with a paraprofessional. The authors administered a multidimensional assessment via the Add+Vantage Math Recovery interview protocol, which examines five areas of number development. The high resolution of the multidimensional assessment revealed that Danny was making errors based on overgeneralization of place value concepts that may be tied to the spatial nature of place-value mats. Danny also lacked comprehension of composite number units. Moreover, the assessment indicated that Danny was, in fact, strong in several mathematics areas, including reading numbers and counting. The authors suggest that an effective instructional intervention for Danny would be to have him create composite number units himself rather than providing pre-made units to him. Overall, this case study shows that multidimensional screening can lead to better responses to intervention results due to the very high resolution afforded by such screenings.

Mahdavi, J. N., & Haager, D. (2007). Linking progress monitoring results to interventions. Perspectives on Language and Literacy, 33(2), 25–29.

Formative progress-monitoring assessments can be useful tools for determining student growth, particularly for students with difficulties in reading. There is a need, however, for information on how data from such assessments can be used to modify and improve instruction in the field. The authors present case studies of two second-grade students with reading difficulties. Both students shared the same teacher, who used the Dynamic Indicators of Basic Early Literacy Skills (DIBELS), a progress-monitoring assessment, to find baseline oral reading fluency. Both students were classified as at-risk of not reaching grade-level proficiency. Since DIBELS provides benchmark oral reading rates for various times of the year, the instructor was able to project a slope of reading improvement for the students and could assess each students' adherence to that slope with additional DIBELS assessments. Upon finding that the projected reading improvement slopes were not being met by either student, the teacher used additional probes to attain a higher-resolution look at specific areas of improvement. Results from these high-resolution probes were used to provide both of the second-grade students with targeted instruction specifically tailored to areas in which they needed improvement.

Olinghouse, N. G., Lambert, W., & Compton, D. (2006). Monitoring children with reading disabilities' response to phonics intervention: Are there differences between intervention aligned and general skill progress monitoring assessments? Exceptional Children, 73(1), 90–106.

There is an increasing need for effective progress-monitoring assessments for the special education population. This study compares the predictive abilities of two progress-monitoring assessment instruments: oral reading fluency (ORF) and intervention aligned word list (IAWL). ORF is a commonly used assessment instrument that has been shown to indicate student reading competence. The IAWL is a novel assessment instrument that the authors created specifically for this study. The IAWL consists of 50 words that students are asked to read; compared to the ORF, the IAWL is tied to the curriculum and is untimed. Study participants were 40 students in a metropolitan school district in the southeastern United States; all students were identified as having learning disabilities. The reading intervention used was the Phonological and Strategy Training Program (PHAST). All students were given a battery of pre- and post-test assessments. Moreover, the ORF and IAWL progress-monitoring assessments were administered at pre-intervention and at regular intervals afterwards. The results show that the IAWL can account for variance in performance changes in decoding and word identification. The ORF accounted for variance in performance changes in passage reading fluency. Overall, the IAWL was more sensitive to intervention-specific performance changes whereas the ORF was a better indicator of more general measures of reading competence. Both the IAWL and ORF leave 49–74 percent of the unique variance in performance gains unaccounted for.

Ysseldyke, J., Burns, M. K., Scholin, S. E., & Parker, D. C. (2010). Instructionally valid assessment within response to intervention. Teaching Exceptional Children, 42(4), 54–61.

Though many local education agencies seek to implement response to interview (RTI) techniques, many schools use assessments in a way that is invalid and incompatible with RTI principles. The authors describe how assessment is best used to support RTI. The authors argue that an assessment's purpose in RTI is to help identify at-risk students, determine what educational interventions should be applied to at-risk students, and measure the response to a given intervention. That is, the primary goal of assessment in RTI is to help teachers determine what instructional modifications are most effective for at-risk students; this assessment goal stands in contrast to the traditional conception of assessment as measuring student achievement on pre-specified achievement benchmarks. The authors present characteristics of instructionally relevant RTI assessment: namely, such assessments are precise, frequent, and sensitive to student change. The authors conclude by proposing that educators think of RTI as a resource allocation scheme, with more resources being allocated to students at higher tiers of interventional need.

Contact Us

1120 East Diehl Road
Suite 200
Naperville, Ill. 60563-1486
Phone:630-649-6500
Fax:630-649-6700
Email:info@illinoicsi.org