5 research outputs found

    Reduced Rank Classification and Estimation of the Actual Error Rate

    No full text
    170 p.Thesis (Ph.D.)--University of Illinois at Urbana-Champaign, 1978.U of I OnlyRestricted to the U of I community idenfinitely during batch ingest of legacy ETD

    Recommendations for conducting differential item functioning (DIF) analyses for students with disabilities based on previous

    No full text
    Abstract The purpose of this study is to help ensure that strategies for differential item functioning (DIF) detection for students with disabilities are appropriate and lead to meaningful results. We surveyed existing DIF studies for students with disabilities and describe them in terms of study design, statistical approach, sample characteristics, and DIF results. Based on descriptive and graphical summaries of previous DIF studies, we make recommendations for future studies of DIF for students with disabilities

    ETS R&D Scientific and Policy Contributions Series ETS Contributions to the Quantitative Assessment of Item, Test, and Score Fairness ETS R&D Scientific and Policy Contributions Series ETS Contributions to the Quantitative Assessment of Item, Test, and Sc

    No full text
    Since its 1947 founding, ETS has conducted and disseminated scientific research to support its products and services, and to advance the measurement and education fields. In keeping with these goals, ETS is committed to making its research freely available to the professional community and to the general public. Published accounts of ETS research, including papers in the ETS R&D Scientific and Policy Contributions series, undergo a formal peerreview process by ETS staff to ensure that they meet established scientific and professional standards. All such ETS-conducted peer reviews are in addition to any reviews that outside organizations may provide as part of their own publication processes. Peer review notwithstanding, the positions expressed in the ETS R&D Scientific and Policy Contributions series and other published accounts of ETS research are those of the authors and not necessarily those of the Officers and Trustees of ETS. The Daniel Eignor Editorship is named in honor of Dr. Daniel R. Eignor, who from 2001 until 2011 served the Research and Development division as Editor for the ETS Research Report series. The Eignor Editorship has been created to recognize the pivotal leadership role that Dr. Eignor played in the research publication process at ETS. i Abstract Quantitative fairness procedures have been developed and modified by ETS staff over the past several decades. ETS has been a leader in fairness assessment, and its efforts are reviewed in this report. The first section deals with differential prediction and differential validity procedures that examine whether test scores predict a criterion, such as performance in college, across different subgroups in a similar manner. The bulk of this report focuses on item level fairness, or differential item functioning, which is addressed in the various subsections of the second section. ETS Contributions to the In the third section, I consider research pertaining to whether tests built to the same set of specifications produce scores that are related in the same way across different gender and ethnic groups. Limitations with the approaches reviewed here are discussed in the final section. Key words: fairness, differential prediction, differential item functioning, score equity assessment, ETS, quantitative methods ii Foreword Since its founding in 1947, ETS has conducted a significant and wide-ranging research program that has focused on, among other things, psychometric and statistical methodology; educational evaluation; performance assessment and scoring; large-scale assessment and evaluation; cognitive, developmental, personality, and social psychology; and education policy. This broadbased research program has helped build the science and practice of educational measurement, as well as inform policy debates. In 2010, we began to synthesize these scientific and policy contributions, with the intention to release a series of reports sequentially over the course of the next few years. These reports constitute the ETS R&D Scientific and Policy Contributions Series. In the seventh report in the series, Neil Dorans looks at quantitative fairness assessment procedures developed and modified by ETS staff, which have helped to make ETS a leader in fairness assessment. Almost since the inception of the organization in 1947, ETS has been concerned with the issues of fairness. In the late 1940s and early 1950s, William Turnbull, who later became the second president of ETS, was an early advocate of fairness, recommending the comparison of prediction equations as a method for assessing test fairness. In the 1980s, interest in fairness in the assessment community shifted from scores to items, as evidenced in widespread studies of differential item functioning (DIF). ETS, under the direction of Gregory Anrig, the third ETS president, established the industry standard for fairness assessment at the item level, and ETS has been in the vanguard in conducting DIF analyses as a standard psychometric check of test quality for over a quarter of a century. Dorans is a distinguished presidential appointee at ETS. A major focus of his operational and research efforts during his career at ETS has been on the quantitative evaluation of fairness at the item and score levels. Since the early 1980s, he has been involved with most score equatings of the SAT ® test. Equating is an essential to the process of producing fair scores. The architect for the recentered SAT scales, he has also performed score linking studies relating the SAT I to the ACT and the Prueba de Aptitud Academica (PAA™). Dorans co-edited a book on score linking and scale aligning, and co-authored a book on computer adaptive testing. Procedures, limitations of these procedures are mentioned. Fair Prediction of a Criterion Turnbull (1951a) concluded his early ETS treatment of fairness with the following statement: "Fairness, like its amoral brother, validity, resides not in tests or test scores but in the relation to its uses" (p. 4-5). While several ETS authors had addressed the relative lower performance of minority groups on tests of cognitive ability and its relationship to grades (e.g., Campbell, 1964), Cleary (1968) conducted one of the first differential prediction studies. That study has been widely cited and critiqued. A few years after the Cleary article, the field was replete with differential validity 2 studies, which focus on comparing correlation coefficients, and differential prediction studies, which focus on comparing regression functions, in large part because of interest engendered by the Supreme Court decision Griggs v. Duke Power Co. in 1971. This decision included the terms business necessity and adverse impact, both of which affected employment testing. Adverse impact is a substantially different rate of selection in hiring, promotion, transfer, training, or other employment-related decisions for any race, sex, or ethnic group. Business necessity can be used by an employer as a justification for using a selection mechanism that appears to be neutral with respect to sex, race, national origin, or religious group even though it excludes members of one sex, race, national origin, or religious group at a substantially higher rate than members of other groups. The employer must prove that the selection requirement having the adverse impact is job related and consistent with business necessity. In other words, in addition to avoiding the use of race/ethnic/gender explicitly as part of the selection process, the selection instrument had to have demonstrated validity for its use. Ideally, this validity would be the same for all subpopulations. Linn (1972) considered the implications of the Griggs decision for test makers and users. A main implication was that there would be a need for empirical demonstrations that test scores predict criterion performance, such as how well one does on the job. (In an educational context, test scores may be used with other information to predict the criterion of average course grade). Reliability alone would not be an adequate justification for use of test scores. Linn also noted that for fair prediction to hold, the prediction model must include all the appropriate variables in the model. Otherwise misspecification of the model can give the appearance of statistical bias. The prediction model should include all the predictors needed to predict Y and the functional form used to combine the predictors should be the correct one. The reliabilities of the predictors also were noted to play a role. These limitations with differential validity and differential predictions studies were cogently summarized in four pages by Linn (1975) later noted that differential prediction analyses should be preferred to differential validity studies because differences in predictor or criterion variability can produce 3 differential validity even when the prediction model is fair. Differential prediction analyses examine whether the same prediction models hold across different groups. Fair prediction or selection requires invariance of prediction equations across groups, where R is the symbol for the function used to predict Y, the criterion score, from X, the predictor. G is a variable indicating subgroup membership. Petersen and Novick (1976) compared several models for assessing fair selection, including the regression model Linn (1976) in his discussion of the Differential Item Functioning (DIF) During the 1980s, the focus in the profession shifted to DIF studies. Although interest in item bias studies began in the 1960s (Angoff, 1993), it was not until the 1980s that interest in fair assessment at the item level became widespread. During the 1980s, the measurement profession engaged in the development of item level models for a wide array of purposes. DIF procedures developed as part of that shift in attention from the score to the item. Moving the focus of attention to prediction of item scores, which is what DIF is about, represented a major change from focusing primarily on fairness in a domain, where so many factors could spoil the validity effort, to a domain where analyses could be conducted in a relatively simple, less confounded way. While factors such as multidimensionality can complicate a DIF analysis, as described by Shealy and Stout 2 (1993), they are negligible compared to the many influences that can undermine a differential prediction study, as described in Around 100 ETS research bulletins, memoranda, or reports have been produced on the topics of item fairness, DIF, or item bias. The vast majority of these studies were published in the late 1980s and early 1990s. The major emphases of these reports can be sorted into categories and are treated in subsections of this section: Differential Item Functioning Methods, Matching Variable Issues, Study Group Definitions, and Sample Size and Power Issues. The DIF methods section begins with some definitions followed by a review of procedures that were suggested 5 before the term DIF was introduced. Most of the section then describes the following procedures: Mantel-Haenszel (MH), standardization (STAND), item response theory (IRT), and SIBTEST. Differential Item Functioning (DIF) Methods Two reviews of DIF methods were conducted by ETS staff: Dorans and Potenza (1994), which was shortened and published as Middleton Null DIF is the absence of DIF. One definition of null DIF, observed score null DIF, is that all individuals with the same score on a test should have the same proportions answering the item correctly regardless of whether they are from the reference or focal group. The latent variable definition of null DIF can be used to compare the performance of focal and reference subgroups that are matched with respect to a latent variable. An observed difference in average item scores between two groups that may differ in their distributions of score on the matching variable is referred to as impact. With impact, we compare groups that may or may not be comparable with respect to the construct being measured by the item; using DIF, we compare groups that are comparable with respect to an estimate of their standing on that construct. The studied item score refers to the scoring rule used for the items being studied for DIF. Studied items can either be scored as correct/incorrect (i.e., binary) or scored using more than two response categories (i.e., polytomous). The matching variable is a variable used in the process of comparing the reference and focal groups (e.g., total test score or subscore) so that comparable groups are formed. In other words, matching is a way of establishing score equivalence between groups that are of interest in DIF analyses. Angoff and Sharon (1974) also employed an analysis of variance (ANOVA) method, but by then the transformed item difficulty (TID) or delta-plot method had been adopted for item bias research. Angoff (1972) introduced this approach, which was rooted in Thurstone's absolute scaling model. This method had been employed by Tucker (1951) in a study of academic ability on vocabulary items and by This method uses an inverse normal transformation to convert item proportion-correct values for two groups to normal deviates that are expect to form an ellipse. Items that deviate from the ellipse exhibit the item difficulty by group interaction that is indicative of what was called item bias. Ford (1971, 1973) are the standard references for this approach. The delta-plot method is akin to the Rasch (1960) model approach to assessing DIF. If items differ in their discriminatory power and the groups under study differ in terms of proficiency, then items will exhibit item-by-group interactions even when there are no differences in item functioning. This point was noted by several scholars including 7 Two procedures may be viewed as precursors of the eventual move to condition directly on total score that was adopted by the STAND Mantel-Haenszel (MH): Original implementation at ETS. In their seminal paper, Mantel and Haenszel 4 (1959) introduced a new procedure for the study of matched groups. Holland and Thayer (1986, 1988) adapted the procedure for use in assessing DIF. This adaptation, the MH method, is used at ETS as the primary DIF detection device. The basic data used by the MH method are in the form of M 2-by-2 contingency tables or one large three dimensional 2-by-2-by-M table, where M is the number of levels of the matching variable. Under rights-scoring for the items in which responses are coded as either correct or incorrect (including omissions), proportions of rights and wrongs on each item in the target population can be arranged into a contingency table for each item being studied. There are two levels for group: the focal group (f) that is the focus of analysis, and the reference group (r) that serves as a basis for comparison for the focal group. There are also two levels for item response: right (R) or wrong (W), and there are M score levels on the matching variable, (e.g., total score). Finally, the item being analyzed is referred to as the studied item. The 2 (groups)-by-2 (item 8 scores)-by-M (score levels) contingency table (see Table1 2-by-2-by-M Contingency In other words, the odds of getting the item correct at a given level of the matching variable is the same in both the focal group and the reference group portions of the population, and this equality holds across all M levels of the matching variable. Note that when α = 1, the alternative hypothesis reduces to the null DIF hypothesis. The parameter α is called the common odds ratio in the M 2-by-2 tables because under H a , the value of α is the odds ratio that is the same for all m. Holland and Thayer (1988) reported that the MH approach is the test possessing the most statistical power for detecting departures from the null DIF hypothesis that are consistent with the constant odds-ratio hypothesis. Mantel and Haenszel (1959) also provided an estimate of the constant odds -ratio that ranges from 0 to ∞, for which a value of 1 can be taken to indicate null DIF. This odds-ratio metric is not particularly meaningful to test developers who are used to working with numbers 9 on an item difficulty scale. In general, odds are converted to log e odds because the latter is symmetric around zero and easier to interpret. At ETS, test developers use item difficulty estimates in the delta metric, which has a mean of 13 and a standard deviation of 4. Large values of delta correspond to difficult items, while easy items have small values of delta. found that both the MH and STAND methods had problems detecting IRT DIF in items with nonzero lower asymptotes. Their two major findings were the need to have enough items in the matching variable to ensure reliable matching for either method and the need to include the studied item in the matching variable in MH analysis. This study thus provided support for the analytical argument for inclusion of the studied item that had been made by 10 Longford demonstrated how to use a random-effect or variance-component model to aggregate DIF results for groups of items. In particular they showed how to combine DIF estimates from several administrations to obtain variance components for administration differences for DIF within an item. In their examples, they demonstrated how to use their models to improve estimations within an administration, and how to combine evidence across items in randomized DIF studies. Subsequently, ETS researchers have employed Bayesian methods with the goal of pooling data across administrations to yield more stable DIF estimates within an administration. These approaches are discussed in the section on sample size and power issues. Allen and Holland (1993) used a missing data framework to address the missing data problem in DIF analyses where "no response" to the self-reported group identification question is large, a common problem in applied settings. They showed how MH and STAND statistics can be affected by different assumptions about nonresponses. Zwick and her colleagues examined DIF in the context of computer adaptive testing (CAT) in which tests are tailored to the individual test taker on the basis of his or her response to previous items. Zwick, Thayer, and Wingersky (1993) described in great detail a simulation study in which they examined the performance of MH and STAND procedures that had been modified for use with data collected adaptively. The modification to the DIF procedures involved replacing the standard number-right matching variable with a matching variable based on IRT, which was obtained by converting a maximum likelihood estimate of ability to an expected number-right true score on all items in the reference pool. Examinees whose expected true scores fell in the same one-unit intervals were considered to be matched. They found that DIF statistics computed in this way for CAT were similar to those obtained with the traditional matching variable of performance on the total test. In addition they found that pretest DIF statistics were generally well behaved, but the MH DIF statistics tended to have larger standard errors for the pretest items than for the CAT items. Zwick, Thayer, and Wingersky (1994) addressed the effect of using alternative matching methods for pretest items. Using a more elegant matching procedure did not lead to a reduction of the MH standard errors and produced DIF measures that were nearly identical to those from the earlier study. Further investigation showed that the MH standard errors tended to be larger when items were administered to examinees with a wide ability range, whereas the opposite was 11 true of the standard errors of the STAND DIF statistic. As reported in CAT can be thought of as a very complex form of item sampling. The sampling procedure used by the National Assessment of Educational Progress (NAEP) is another form of complex sampling. Allen and Donoghue (1995) used a simulation study to examine the effect of complex sampling of items on the measurement of DIF using the MH DIF procedure. Data were generated using a three-parameter logistic (3PL) IRT model according to the balanced incomplete block design. The length of each block of items and the number of DIF items in the matching variable were varied, as was the difficulty, discrimination, and presence of DIF in the studied item. Block, booklet, pooled booklet, and other approaches to matching on more than the block, were compared to a complete data analysis using the transformed log-odds on the delta scale. The pooled booklet approach was recommended for use when items are selected for examinees according to a balanced incomplete block (BIB) data collection design. Zwick, Donoghue, and Grima (1993) noted that some forms of performance assessment may in fact be more likely to tap construct-irrelevant factors than multiple-choice items are. The assessment of DIF can be used to investigate the effect on subpopulations of the introduction of performance tasks. Two extensions of the MH procedure were explored: the test of conditional association proposed by Standardization (STAND)

    The Stability of the Score Scales for the SAT Reasoning Test â„¢ ETS Research Report Series EIGNOR EXECUTIVE EDITOR The Stability of the Score Scales for the SAT Reasoning Test â„¢ From 2005 to 2010

    No full text
    Since its 1947 founding, ETS has conducted and disseminated scientific research to support its products and services, and to advance the measurement and education fields. In keeping with these goals, ETS is committed to making its research freely available to the professional community and to the general public. Published accounts of ETS research, including papers in the ETS Research Report series, undergo a formal peer-review process by ETS staff to ensure that they meet established scientific and professional standards. All such ETS-conducted peer reviews are in addition to any reviews that outside organizations may provide as part of their own publication processes. SAT REASONING TEST is a trademark of the College Board. As part of its nonprofit mission, ETS conducts and disseminates the results of research to advance quality and equity in education and assessment for the benefit of ETS's constituents and the field. To obtain a PDF or a print copy of a report, please visit: scale has experienced a significant upward scale drift (11 points on average), which may be caused by sources other than random equating errors
    corecore