8 research outputs found

    ETS R&D Scientific and Policy Contributions Series ETS Contributions to the Quantitative Assessment of Item, Test, and Score Fairness ETS R&D Scientific and Policy Contributions Series ETS Contributions to the Quantitative Assessment of Item, Test, and Sc

    No full text
    Since its 1947 founding, ETS has conducted and disseminated scientific research to support its products and services, and to advance the measurement and education fields. In keeping with these goals, ETS is committed to making its research freely available to the professional community and to the general public. Published accounts of ETS research, including papers in the ETS R&D Scientific and Policy Contributions series, undergo a formal peerreview process by ETS staff to ensure that they meet established scientific and professional standards. All such ETS-conducted peer reviews are in addition to any reviews that outside organizations may provide as part of their own publication processes. Peer review notwithstanding, the positions expressed in the ETS R&D Scientific and Policy Contributions series and other published accounts of ETS research are those of the authors and not necessarily those of the Officers and Trustees of ETS. The Daniel Eignor Editorship is named in honor of Dr. Daniel R. Eignor, who from 2001 until 2011 served the Research and Development division as Editor for the ETS Research Report series. The Eignor Editorship has been created to recognize the pivotal leadership role that Dr. Eignor played in the research publication process at ETS. i Abstract Quantitative fairness procedures have been developed and modified by ETS staff over the past several decades. ETS has been a leader in fairness assessment, and its efforts are reviewed in this report. The first section deals with differential prediction and differential validity procedures that examine whether test scores predict a criterion, such as performance in college, across different subgroups in a similar manner. The bulk of this report focuses on item level fairness, or differential item functioning, which is addressed in the various subsections of the second section. ETS Contributions to the In the third section, I consider research pertaining to whether tests built to the same set of specifications produce scores that are related in the same way across different gender and ethnic groups. Limitations with the approaches reviewed here are discussed in the final section. Key words: fairness, differential prediction, differential item functioning, score equity assessment, ETS, quantitative methods ii Foreword Since its founding in 1947, ETS has conducted a significant and wide-ranging research program that has focused on, among other things, psychometric and statistical methodology; educational evaluation; performance assessment and scoring; large-scale assessment and evaluation; cognitive, developmental, personality, and social psychology; and education policy. This broadbased research program has helped build the science and practice of educational measurement, as well as inform policy debates. In 2010, we began to synthesize these scientific and policy contributions, with the intention to release a series of reports sequentially over the course of the next few years. These reports constitute the ETS R&D Scientific and Policy Contributions Series. In the seventh report in the series, Neil Dorans looks at quantitative fairness assessment procedures developed and modified by ETS staff, which have helped to make ETS a leader in fairness assessment. Almost since the inception of the organization in 1947, ETS has been concerned with the issues of fairness. In the late 1940s and early 1950s, William Turnbull, who later became the second president of ETS, was an early advocate of fairness, recommending the comparison of prediction equations as a method for assessing test fairness. In the 1980s, interest in fairness in the assessment community shifted from scores to items, as evidenced in widespread studies of differential item functioning (DIF). ETS, under the direction of Gregory Anrig, the third ETS president, established the industry standard for fairness assessment at the item level, and ETS has been in the vanguard in conducting DIF analyses as a standard psychometric check of test quality for over a quarter of a century. Dorans is a distinguished presidential appointee at ETS. A major focus of his operational and research efforts during his career at ETS has been on the quantitative evaluation of fairness at the item and score levels. Since the early 1980s, he has been involved with most score equatings of the SAT ® test. Equating is an essential to the process of producing fair scores. The architect for the recentered SAT scales, he has also performed score linking studies relating the SAT I to the ACT and the Prueba de Aptitud Academica (PAA™). Dorans co-edited a book on score linking and scale aligning, and co-authored a book on computer adaptive testing. Procedures, limitations of these procedures are mentioned. Fair Prediction of a Criterion Turnbull (1951a) concluded his early ETS treatment of fairness with the following statement: "Fairness, like its amoral brother, validity, resides not in tests or test scores but in the relation to its uses" (p. 4-5). While several ETS authors had addressed the relative lower performance of minority groups on tests of cognitive ability and its relationship to grades (e.g., Campbell, 1964), Cleary (1968) conducted one of the first differential prediction studies. That study has been widely cited and critiqued. A few years after the Cleary article, the field was replete with differential validity 2 studies, which focus on comparing correlation coefficients, and differential prediction studies, which focus on comparing regression functions, in large part because of interest engendered by the Supreme Court decision Griggs v. Duke Power Co. in 1971. This decision included the terms business necessity and adverse impact, both of which affected employment testing. Adverse impact is a substantially different rate of selection in hiring, promotion, transfer, training, or other employment-related decisions for any race, sex, or ethnic group. Business necessity can be used by an employer as a justification for using a selection mechanism that appears to be neutral with respect to sex, race, national origin, or religious group even though it excludes members of one sex, race, national origin, or religious group at a substantially higher rate than members of other groups. The employer must prove that the selection requirement having the adverse impact is job related and consistent with business necessity. In other words, in addition to avoiding the use of race/ethnic/gender explicitly as part of the selection process, the selection instrument had to have demonstrated validity for its use. Ideally, this validity would be the same for all subpopulations. Linn (1972) considered the implications of the Griggs decision for test makers and users. A main implication was that there would be a need for empirical demonstrations that test scores predict criterion performance, such as how well one does on the job. (In an educational context, test scores may be used with other information to predict the criterion of average course grade). Reliability alone would not be an adequate justification for use of test scores. Linn also noted that for fair prediction to hold, the prediction model must include all the appropriate variables in the model. Otherwise misspecification of the model can give the appearance of statistical bias. The prediction model should include all the predictors needed to predict Y and the functional form used to combine the predictors should be the correct one. The reliabilities of the predictors also were noted to play a role. These limitations with differential validity and differential predictions studies were cogently summarized in four pages by Linn (1975) later noted that differential prediction analyses should be preferred to differential validity studies because differences in predictor or criterion variability can produce 3 differential validity even when the prediction model is fair. Differential prediction analyses examine whether the same prediction models hold across different groups. Fair prediction or selection requires invariance of prediction equations across groups, where R is the symbol for the function used to predict Y, the criterion score, from X, the predictor. G is a variable indicating subgroup membership. Petersen and Novick (1976) compared several models for assessing fair selection, including the regression model Linn (1976) in his discussion of the Differential Item Functioning (DIF) During the 1980s, the focus in the profession shifted to DIF studies. Although interest in item bias studies began in the 1960s (Angoff, 1993), it was not until the 1980s that interest in fair assessment at the item level became widespread. During the 1980s, the measurement profession engaged in the development of item level models for a wide array of purposes. DIF procedures developed as part of that shift in attention from the score to the item. Moving the focus of attention to prediction of item scores, which is what DIF is about, represented a major change from focusing primarily on fairness in a domain, where so many factors could spoil the validity effort, to a domain where analyses could be conducted in a relatively simple, less confounded way. While factors such as multidimensionality can complicate a DIF analysis, as described by Shealy and Stout 2 (1993), they are negligible compared to the many influences that can undermine a differential prediction study, as described in Around 100 ETS research bulletins, memoranda, or reports have been produced on the topics of item fairness, DIF, or item bias. The vast majority of these studies were published in the late 1980s and early 1990s. The major emphases of these reports can be sorted into categories and are treated in subsections of this section: Differential Item Functioning Methods, Matching Variable Issues, Study Group Definitions, and Sample Size and Power Issues. The DIF methods section begins with some definitions followed by a review of procedures that were suggested 5 before the term DIF was introduced. Most of the section then describes the following procedures: Mantel-Haenszel (MH), standardization (STAND), item response theory (IRT), and SIBTEST. Differential Item Functioning (DIF) Methods Two reviews of DIF methods were conducted by ETS staff: Dorans and Potenza (1994), which was shortened and published as Middleton Null DIF is the absence of DIF. One definition of null DIF, observed score null DIF, is that all individuals with the same score on a test should have the same proportions answering the item correctly regardless of whether they are from the reference or focal group. The latent variable definition of null DIF can be used to compare the performance of focal and reference subgroups that are matched with respect to a latent variable. An observed difference in average item scores between two groups that may differ in their distributions of score on the matching variable is referred to as impact. With impact, we compare groups that may or may not be comparable with respect to the construct being measured by the item; using DIF, we compare groups that are comparable with respect to an estimate of their standing on that construct. The studied item score refers to the scoring rule used for the items being studied for DIF. Studied items can either be scored as correct/incorrect (i.e., binary) or scored using more than two response categories (i.e., polytomous). The matching variable is a variable used in the process of comparing the reference and focal groups (e.g., total test score or subscore) so that comparable groups are formed. In other words, matching is a way of establishing score equivalence between groups that are of interest in DIF analyses. Angoff and Sharon (1974) also employed an analysis of variance (ANOVA) method, but by then the transformed item difficulty (TID) or delta-plot method had been adopted for item bias research. Angoff (1972) introduced this approach, which was rooted in Thurstone's absolute scaling model. This method had been employed by Tucker (1951) in a study of academic ability on vocabulary items and by This method uses an inverse normal transformation to convert item proportion-correct values for two groups to normal deviates that are expect to form an ellipse. Items that deviate from the ellipse exhibit the item difficulty by group interaction that is indicative of what was called item bias. Ford (1971, 1973) are the standard references for this approach. The delta-plot method is akin to the Rasch (1960) model approach to assessing DIF. If items differ in their discriminatory power and the groups under study differ in terms of proficiency, then items will exhibit item-by-group interactions even when there are no differences in item functioning. This point was noted by several scholars including 7 Two procedures may be viewed as precursors of the eventual move to condition directly on total score that was adopted by the STAND Mantel-Haenszel (MH): Original implementation at ETS. In their seminal paper, Mantel and Haenszel 4 (1959) introduced a new procedure for the study of matched groups. Holland and Thayer (1986, 1988) adapted the procedure for use in assessing DIF. This adaptation, the MH method, is used at ETS as the primary DIF detection device. The basic data used by the MH method are in the form of M 2-by-2 contingency tables or one large three dimensional 2-by-2-by-M table, where M is the number of levels of the matching variable. Under rights-scoring for the items in which responses are coded as either correct or incorrect (including omissions), proportions of rights and wrongs on each item in the target population can be arranged into a contingency table for each item being studied. There are two levels for group: the focal group (f) that is the focus of analysis, and the reference group (r) that serves as a basis for comparison for the focal group. There are also two levels for item response: right (R) or wrong (W), and there are M score levels on the matching variable, (e.g., total score). Finally, the item being analyzed is referred to as the studied item. The 2 (groups)-by-2 (item 8 scores)-by-M (score levels) contingency table (see Table1 2-by-2-by-M Contingency In other words, the odds of getting the item correct at a given level of the matching variable is the same in both the focal group and the reference group portions of the population, and this equality holds across all M levels of the matching variable. Note that when α = 1, the alternative hypothesis reduces to the null DIF hypothesis. The parameter α is called the common odds ratio in the M 2-by-2 tables because under H a , the value of α is the odds ratio that is the same for all m. Holland and Thayer (1988) reported that the MH approach is the test possessing the most statistical power for detecting departures from the null DIF hypothesis that are consistent with the constant odds-ratio hypothesis. Mantel and Haenszel (1959) also provided an estimate of the constant odds -ratio that ranges from 0 to ∞, for which a value of 1 can be taken to indicate null DIF. This odds-ratio metric is not particularly meaningful to test developers who are used to working with numbers 9 on an item difficulty scale. In general, odds are converted to log e odds because the latter is symmetric around zero and easier to interpret. At ETS, test developers use item difficulty estimates in the delta metric, which has a mean of 13 and a standard deviation of 4. Large values of delta correspond to difficult items, while easy items have small values of delta. found that both the MH and STAND methods had problems detecting IRT DIF in items with nonzero lower asymptotes. Their two major findings were the need to have enough items in the matching variable to ensure reliable matching for either method and the need to include the studied item in the matching variable in MH analysis. This study thus provided support for the analytical argument for inclusion of the studied item that had been made by 10 Longford demonstrated how to use a random-effect or variance-component model to aggregate DIF results for groups of items. In particular they showed how to combine DIF estimates from several administrations to obtain variance components for administration differences for DIF within an item. In their examples, they demonstrated how to use their models to improve estimations within an administration, and how to combine evidence across items in randomized DIF studies. Subsequently, ETS researchers have employed Bayesian methods with the goal of pooling data across administrations to yield more stable DIF estimates within an administration. These approaches are discussed in the section on sample size and power issues. Allen and Holland (1993) used a missing data framework to address the missing data problem in DIF analyses where "no response" to the self-reported group identification question is large, a common problem in applied settings. They showed how MH and STAND statistics can be affected by different assumptions about nonresponses. Zwick and her colleagues examined DIF in the context of computer adaptive testing (CAT) in which tests are tailored to the individual test taker on the basis of his or her response to previous items. Zwick, Thayer, and Wingersky (1993) described in great detail a simulation study in which they examined the performance of MH and STAND procedures that had been modified for use with data collected adaptively. The modification to the DIF procedures involved replacing the standard number-right matching variable with a matching variable based on IRT, which was obtained by converting a maximum likelihood estimate of ability to an expected number-right true score on all items in the reference pool. Examinees whose expected true scores fell in the same one-unit intervals were considered to be matched. They found that DIF statistics computed in this way for CAT were similar to those obtained with the traditional matching variable of performance on the total test. In addition they found that pretest DIF statistics were generally well behaved, but the MH DIF statistics tended to have larger standard errors for the pretest items than for the CAT items. Zwick, Thayer, and Wingersky (1994) addressed the effect of using alternative matching methods for pretest items. Using a more elegant matching procedure did not lead to a reduction of the MH standard errors and produced DIF measures that were nearly identical to those from the earlier study. Further investigation showed that the MH standard errors tended to be larger when items were administered to examinees with a wide ability range, whereas the opposite was 11 true of the standard errors of the STAND DIF statistic. As reported in CAT can be thought of as a very complex form of item sampling. The sampling procedure used by the National Assessment of Educational Progress (NAEP) is another form of complex sampling. Allen and Donoghue (1995) used a simulation study to examine the effect of complex sampling of items on the measurement of DIF using the MH DIF procedure. Data were generated using a three-parameter logistic (3PL) IRT model according to the balanced incomplete block design. The length of each block of items and the number of DIF items in the matching variable were varied, as was the difficulty, discrimination, and presence of DIF in the studied item. Block, booklet, pooled booklet, and other approaches to matching on more than the block, were compared to a complete data analysis using the transformed log-odds on the delta scale. The pooled booklet approach was recommended for use when items are selected for examinees according to a balanced incomplete block (BIB) data collection design. Zwick, Donoghue, and Grima (1993) noted that some forms of performance assessment may in fact be more likely to tap construct-irrelevant factors than multiple-choice items are. The assessment of DIF can be used to investigate the effect on subpopulations of the introduction of performance tasks. Two extensions of the MH procedure were explored: the test of conditional association proposed by Standardization (STAND)

    Repurposing a Business Learning Outcomes Assessment to College Students Outside of the United States: Validity and Reliability Evidence ETS Research Report Series EIGNOR EXECUTIVE EDITOR Repurposing a Business Learning Outcomes Assessment to College Stude

    No full text
    Since its 1947 founding, ETS has conducted and disseminated scientific research to support its products and services, and to advance the measurement and education fields. In keeping with these goals, ETS is committed to making its research freely available to the professional community and to the general public. Published accounts of ETS research, including papers in the ETS Research Report series, undergo a formal peer-review process by ETS staff to ensure that they meet established scientific and professional standards. All such ETS-conducted peer reviews are in addition to any reviews that outside organizations may provide as part of their own publication processes

    Weighting Test Samples in IRT Linking and Equating: Toward an Improved Sampling Design for Complex Equating ETS Research Report Series EIGNOR EXECUTIVE EDITOR Weighting Test Samples in IRT Linking and Equating: Toward an Improved Sampling Design for Compl

    No full text
    Since its 1947 founding, ETS has conducted and disseminated scientific research to support its products and services, and to advance the measurement and education fields. In keeping with these goals, ETS is committed to making its research freely available to the professional community and to the general public. Published accounts of ETS research, including papers in the ETS Research Report series, undergo a formal peer-review process by ETS staff to ensure that they meet established scientific and professional standards. All such ETS-conducted peer reviews are in addition to any reviews that outside organizations may provide as part of their own publication processes. SAT is a registered trademark of the College Board. i Abstract Several factors could cause variability in item response theory (IRT) linking and equating procedures, such as the variability across examinee samples and/or test items, seasonality, regional differences, native language diversity, gender, and other demographic variables. Hence, the following question arises: Is it possible to select optimal samples of examinees so that the IRT linking and equating can be more precise at an administration level as well as over a large number of administrations? This is a question of optimal sampling design in linking and equating. To obtain an improved sampling design for invariant linking and equating across testing administrations, we applied weighting techniques to yield a weighted sample distribution that is consistent with the target population distribution. The goal is to obtain a stable StockingLord test characteristic curve (TCC) linking and a true-score equating that is invariant across administrations. To study the weighting effects on linking, we first selected multiple subsamples from a data set. We then compared the linking parameters from subsamples with those from the data and examined whether the linking parameters from the weighted sample yielded smaller mean square errors (MSE) than those from the unweighted subsample. To study the weighting effects on true-score equating, we also compared the distributions of the equated scores. Generally, the findings were that the weighting produced good results

    Identifying the Most Important 21st Century Workforce Competencies: An Analysis of the Occupational Information Network (O*NET) ETS Research Report Series EIGNOR EXECUTIVE EDITOR Identifying the Most Important 21st Century Workforce Competencies: An Analy

    No full text
    Since its 1947 founding, ETS has conducted and disseminated scientific research to support its products and services, and to advance the measurement and education fields. In keeping with these goals, ETS is committed to making its research freely available to the professional community and to the general public. Published accounts of ETS research, including papers in the ETS Research Report series, undergo a formal peer-review process by ETS staff to ensure that they meet established scientific and professional standards. All such ETS-conducted peer reviews are in addition to any reviews that outside organizations may provide as part of their own publication processes. Consistent with this conclusion, a correlation of component scores with wages found that 4 of these 5 competencies were strongly related to wages, with the exception being teamwork

    A Note on Explaining Away and Paradoxical Results in Multidimensional Item Response Theory ETS Research Report Series EIGNOR EXECUTIVE EDITOR A Note on Explaining Away and Paradoxical Results in Multidimensional Item Response Theory

    No full text
    Since its 1947 founding, ETS has conducted and disseminated scientific research to support its products and services, and to advance the measurement and education fields. In keeping with these goals, ETS is committed to making its research freely available to the professional community and to the general public. Published accounts of ETS research, including papers in the ETS Research Report series, undergo a formal peer-review process by ETS staff to ensure that they meet established scientific and professional standards. All such ETS-conducted peer reviews are in addition to any reviews that outside organizations may provide as part of their own publication processes. Abstract Hooker and colleagues addressed a paradoxical situation that can arise in the application of multidimensional item response theory (MIRT) models to educational test data. We demonstrate that this MIRT paradox is an instance of the explaining-away phenomenon in Bayesian networks, and we attempt to enhance the understanding of MIRT models by placing the paradox in a broader statistical modeling perspective

    Building a Case to Develop Noncognitive Assessment Products and Services Targeting Workforce Readiness at ETS ETS Research Memorandum Series Building a Case to Develop Noncognitive Assessment Products and Services Targeting Workforce Readiness at ETS

    No full text
    Since its 1947 founding, ETS has conducted and disseminated scientific research to support its products and services, and to advance the measurement and education fields. In keeping with these goals, ETS is committed to making its research freely available to the professional community and to the general public. Published accounts of ETS research, including papers in the ETS Research Report series, undergo a formal peer-review process by ETS staff to ensure that they meet established scientific and professional standards. All such ETS-conducted peer reviews are in addition to any reviews that outside organizations may provide as part of their own publication processes. ETS, the ETS logo, GRE, LISTENING. LEARNING. LEADING., and TOEIC are registered trademarks of Educational Testing Service (ETS). iSKILLS and PRAXIS are trademarks of ETS. SAT is a registered trademark of the College Board. As part of its nonprofit mission, ETS conducts and disseminates the results of research to advance quality and equity in education and assessment for the benefit of ETS's constituents and the field. To obtain a PDF or a print copy of a report, please visit: http://www.ets.org/research/contact.html i Abstract The goal of this paper is to establish the case for conducting research on the readiness of individuals for the workforce as part of the Workforce Readiness Initiative at ETS, with specific emphasis, at least initially, on noncognitive indicators of readiness. We begin by defining a conceptual framework that encompasses noncognitive constructs and measures, followed by a brief review of the literature highlighting the importance of noncognitive predictors in education and the workforce. Next, we examine the importance of research on workforce readiness and consider how ETS can conduct workforce readiness research. Finally, we give an overview of work accomplished in this area at ETS, include an action plan of workforce research in progress, and summarize future planned directions. Of the five main personality factors, conscientiousness has been shown to be the most Given the large body of evidence supporting the importance of noncognitive variables in education and in the workforce, there are a number of ways in which researching noncognitive predictors of workforce skills serves ETS as an organization. This paper highlights arguments from a policy perspective, as well as arguments based on scientific merit and business practices. Importance of Workforce Readiness Research ETS has emerged as a leader in alerting the American public on the need for a better prepared workforce. The ETS Policy Information Report, America's Perfect Storm The First Force The Second Force Research on the noncognitive predictors of workplace performance clearly addresses issues related to ongoing structural changes to the economy. Labor markets are different than in decades past, due in part to "industrial and corporate restructuring, declines in unionization, technological change, and globalization" (Kirsch et al., 2007, p. 6). These changes favor workers who possess a different set of skills than were required under the old economic structure, and several of the most important of these skills can be characterized as noncognitive. Interestingly, these noncognitive skills were rated more important than skills traditionally taught 5 and assessed by high schools and colleges. In short, the business community is explicitly stating that classic cognitive skills are not enough for workplace success and that noncognitive skills are important as well. Addressing these business needs is a vital component of pursuing workforce research at ETS. The Third Force The third force identified by America's Perfect Storm Specifically, there will be a dramatic increase in racial and ethnic diversity over this time period. The changing makeup of the American populace will result in an increase in the importance of noncognitive workforce assessments. Although cognitive assessments consistently demonstrate 6 validity in workforce studies (and are legally defensible when shown to be directly relevant to the job), the use of traditional cognitive assessments alone can present problems in business contexts. For example, using traditional cognitive instruments as the sole predictor in selection contexts typically leads to some racial and ethnic minorities being selected at a lower rate than whites. However, as previously reviewed, noncognitive assessments typically result in less disparate impact than do traditional cognitive assessments. Thus, an increase in noncognitive workforce assessment has the potential to result in selection and training practices that result in a more diverse workplace for the 21st century workforce. Developing the Case for ETS Involvement in Workforce Readiness Research The authors of America's Perfect Storm Mission Consistency Any work conducted by ETS should be consistent with its mission and its status as a taxexempt nonprofit organization. The first part of the ETS mission is "to advance quality and equity in education by providing fair and valid assessments, research and related services." The lobbying restriction. This means that the organization cannot participate in political activities. This will not change as a result of conducting workforce-readiness research. The public benefit test. "The organization must operate for the advantage of public, rather than private, interests. Private interests can be benefited, but only incidentally. Further, the principal beneficiaries of the organization's activities must be sufficiently numerous and well-defined so that the community is, in some way, served" (Bennett, Research on workforce readiness is also consistent with the history of ETS. For example, Carl Brigham, inventor of the SAT, was interested in using assessment as a guide for instruction in addition to using it for selection Having outlined the policy and mission-related goals in conducting workforce research, how can ETS contribute to understanding workforce readiness? To address these policy goals, ETS can develop valid, reliable assessments of noncognitive skills and identify workers who have not accumulated these skills from their previous education. Next, ETS can develop interventions to improve these skills and follow up by assessing the effectiveness of these interventions. Examples of potential avenues of research from a policy perspective are underscored in the section on future directions below. 9 Scientific Merit From a scientific perspective, ETS can provide unique contributions to the understanding of the relation between noncognitive constructs and performance. With access to large samples of graduating students, ETS can leverage its connections by following students after graduation to monitor workforce performance. Workforce research can contribute towards a better solution for the longstanding criterion problem, one of the most important and difficult problems in workforce and general validation research Plan section outlining the current Call Center Study at ETS. In addition, ETS has several advantages over typical firms that study workforce readiness. First, ETS's status as a nonprofit organization places it in a unique position to make substantial scientific contributions. That is, because it is not accountable to shareholders, ETS is free to take risks and attempt to answer more difficult workforce research questions than can the average consulting firm more concerned with its financial bottom line. Furthermore, ETS's status as a world-class testing agency provides it with resources that would not be available to most firms. For example, ETS's psychometric capabilities are unmatched by most, if not all, for profit consulting firms. Business Value Bennett (2008) stated that, in order to fund its mission, a nonprofit organization such as ETS must be responsive to market needs. As evidenced by Casner-Lotto and Barrington It is an open issue whether noncognitive skill assessments at ETS ought to be designed or advertised for purposes of selection or hiring decisions. But selection is only one of many possible uses for workforce assessments. ETS can provide unique insight for assessments 10 designed for training and development, promotion and succession planning purposes. These kinds of noncognitive assessments can help organizations understand where employees might best be placed to improve skills or how to target interventions to promote the skills they would like to improve. Much as the Myers Briggs Type Indicator (MBTI) is widely used to assess employee skills and fit in an organization, ETS has the capability to develop and test products that would identify skilled workers for an organization. The MBTI purports to measure 16 personality types along four dimensions (extraversion-introversion, sensing-intuition, thinking-feeling, and judgment-perception). Despite suffering from significant psychometric problems in terms of both validity and reliability (see The TOEIC ® test is an example of a successful ETS product that assesses skills and learning in a business context, as evidenced by its approximately 6.6 million annual tests administered. By measuring speaking, writing, listening, and reading proficiency in workplace English, the TOEIC not only assesses learning in a workplace context, but also meets a market need from both the business and educational communities. The PRAXIS™ series of teacher licensure and certification assessments also represents an existing workforce assessment at ETS that focuses on the profession of teaching, identifying, and measuring general and subjectspecific teaching skills. Finally, ETS has produced an iSkills™ certification assessment that incorporates computer literacy and critical thinking skills, an assessment designed to hone the proficiencies of college students as they prepare for the workforce. Just as each of these products was developed at ETS to assess workforce needs, creating products that assess important noncognitive skills can contribute to market needs for noncognitive measures that may lead to improved workforce readiness and success. ETS should thus seek to develop valid and reliable assessments of constructs such as teamwork and work ethic (conscientiousness) that have been previously rated as very important to workforce success Summary of Perspectives In the end, the goal of workforce research in an educational testing organization i

    The Stability of the Score Scales for the SAT Reasoning Test â„¢ ETS Research Report Series EIGNOR EXECUTIVE EDITOR The Stability of the Score Scales for the SAT Reasoning Test â„¢ From 2005 to 2010

    No full text
    Since its 1947 founding, ETS has conducted and disseminated scientific research to support its products and services, and to advance the measurement and education fields. In keeping with these goals, ETS is committed to making its research freely available to the professional community and to the general public. Published accounts of ETS research, including papers in the ETS Research Report series, undergo a formal peer-review process by ETS staff to ensure that they meet established scientific and professional standards. All such ETS-conducted peer reviews are in addition to any reviews that outside organizations may provide as part of their own publication processes. SAT REASONING TEST is a trademark of the College Board. As part of its nonprofit mission, ETS conducts and disseminates the results of research to advance quality and equity in education and assessment for the benefit of ETS's constituents and the field. To obtain a PDF or a print copy of a report, please visit: scale has experienced a significant upward scale drift (11 points on average), which may be caused by sources other than random equating errors
    corecore