87 research outputs found

    Building a validity argument for the listening component of the Test de connaissance du français in the context of Quebec immigration

    Full text link
    L’évaluation linguistique est une pratique omniprésente dans les contextes d’immigration, utilisée comme une méthode de collecte de données pour évaluer la capacité des immigrants à communiquer dans la langue du pays d’accueil afin de promouvoir l’intégration sociale et économique ainsi que la productivité au travail (McNamara & Shohamy, 2008). Contrairement aux tests d’anglais, peu d’attention est accordée à l’interprétation et à l’utilisation des scores aux tests en français, ce qui incite - et demande – de la validation des scores pour justifier l’utilisation des tests. Cette étude, qui fait appel aux avancées de la théorie de la validité des tests (Kane, 2006, 2013), construit un argumentaire de validité pour la composante de la compréhension orale du Test de connaissance du français (TCF) dans le contexte de l’immigration au Québec. La théorie de la validité des tests a évolué considérablement depuis le modèle tripartite traditionnel de contenu, de prédiction et de construit (Cronbach et Meehl, 1955), a été conceptualisée comme un construit unitaire (Messick, 1989) et, plus récemment, a été théorisée en termes d’argumentation (Kane, 2006, 2013), empruntant des concepts de modèles d’inférence (Toulmin [1958], 2003), qui englobent des inférences de score, de généralisation, d’explication, d’extrapolation et de décision, ayant des rôles importants dans un argumentaire de validité. Dans une approche de validité fondée sur l’argumentation, les affirmations relatives aux instruments de mesure sont composées de garanties qui doivent être étayées par des études empiriques, qui sont fondamentales pour les affirmations, mais qui appuient également les inférences qui autorisent chacune des affirmations de l’argumentaire. Plus précisément, cette étude a analysé des données empiriques pour appuyer les inférences de scores, de généralisation et d’explication, en proposant trois questions de recherche portant sur la représentativité du construit du TCF, le fonctionnement différentiel d’items et l’utilité de la technique de collecte de données. Les questions ont porté sur les sous-compétences de compréhension orale dont le TCF évalue, le fonctionnement différentiel des items (FDI) selon le genre, la langue maternelle, l’âge et l’emplacement géographique des candidats ainsi que le fonctionnement des items à choix multiples dans l’évaluation de la compréhension orale en langue seconde. Bien que de nombreux modèles statistiques et de mesure soient couramment disponibles pour analyser les données de réponse aux tests, cette étude a privilégié l’analyse factorielle confirmatoire (AFC) pour examiner les sous-compétences de compréhension orale opérationnalisées dans le TCF, en spécifiant des modèles suivant les suggestions d’un comité d’experts. Le modèle unidimensionnel de Rasch a permis de générer les paramètres de difficulté des items entre les sous-groupes d’intérêt pour effectuer les analyses FDI. Et le modèle à réponses nominales (MRN) a été utilisé pour modéliser les options des items à choix multiples. Les résultats issus de ces trois études ont permis d’étayer chacune des inférences retenues dans l’argumentaire de validité pour le TCF. Selon les modèles d’AFC recommandés par le comité d’experts, les résultats suggèrent que les deux versions analysées dans cette étude évaluent principalement la compréhension de l’information explicitement énoncée dans le discours oral, sous représentant ainsi le construit. Quelques items visaient l’habileté à inférer des idées implicites et la compréhension du sujet général ou de l’idée principale, mais cette dernière sous-compétence ne se retrouvait que dans une seule version du test, ce qui suggère que les versions ne sont pas comparables. L’analyse FDI a identifié de nombreux items dans les versions du test et entre les sous-groupes d’intérêt, mais très peu ont été associés à un biais potentiel, qui comprenait la perception de la voix, le genre littéraire et la familiarité du vocabulaire. Par conséquent, étant donné que de nombreux items signalés pour fonctionnement différentiel ne pouvaient pas être associés à un biais potentiel, la réponse à cette question est partiellement élaborée et atténue l’argumentaire de validité. Les résultats du MRN suggèrent que la plupart des items fonctionnaient bien, tandis que d’autres avaient potentiellement deux bonnes réponses. L’approche de la validité fondée sur l’argumentation s’avère utile pour regrouper des études empiriques dans un ensemble cohérent permettant d’étayer et de justifier l’interprétation et les utilisations du TCF dans un contexte d’immigration, qui peut à son tour servir à remédier les points faibles constatés, en fournissant un moyen d’atténuer les réfutations potentielles qui menacent la validité de l’argument. Certaines mises en garde dans le cadre de validation sont également soulignées et concernent l’accessibilité des données pour aborder les inférences d’extrapolation et de décision dans des contextes d’immigration, mais comme Newton et Shaw (2014, p. 142) le soulignent: « l’approche de validité fondée sur l’argumentation sous-tend le fait que la validation n’est pas simplement une étude isolée, mais un programme: potentiellement un programme très intensif ». Et ce programme peut inclure des parties prenantes importantes comme les représentants gouvernementaux qui peuvent aider à compléter l’argumentaire de validité du TCF en matière d’immigration au Québec.Language testing is a ubiquitous practice in immigration contexts used as a data collection procedure to assess immigrants’ ability to communicate in the language of the host country to promote social as well as economic integration and productivity in the workplace (McNamara & Shohamy, 2008). Unlike English tests, little attention has been directed to the interpretation and uses of scores from French proficiency tests, which prompts – indeed, requires – validation research to justify test use. Drawing on advances in test validity theory (Kane, 2006, 2013), this study builds a validity argument for the listening component of the Test de connaissance du français (TCF) in the context of Quebec immigration. Test validity theory has evolved considerably since the traditional tripartite model of content, predictive and construct components (Cronbach & Meehl, 1955), have been conceptualized as a unitary construct (Messick, 1989) and more recently have been theorized in terms of argumentation (Kane, 2006, 2013), borrowing concepts from models of inference (Toulmin [1958], 2003), which include scoring, generalization, explanation, extrapolation and decision inferences that play key roles in a validity argument. In an argument-based approach to validity, claims about testing instruments are composed of warrants that must be supported by backings in the form of empirical studies, which are foundational for the claims, but also support the inferences that authorize each of the claims in the argument. More specifically, this study gathered empirical evidence to support the scoring, generalization and explanation inferences, proposing three research questions that addressed construct representation, potential bias and test method usefulness. The questions were concerned with the listening subskills that the TCF assesses, differential item functioning (DIF) across gender, first language, age, and geographical location as well as the option functioning of multiple choice (MC) items in the assessment of second language listening comprehension. Although multiple statistical and measurement models are readily available to analyze test response data, this study privileged confirmatory factor analysis (CFA) to examine the listening subskills operationalized in the TCF, specifying the models following suggestions from a panel of experts. The unidimensional Rasch model was used to generate the difficulty parameters across subgroups of interest to perform the DIF analyses. And the nominal response model (NRM) was used to model the response options of the MC items. The results from these three studies yielded backings for each of the selected inferences in the validity argument for the TCF. Based on the CFA models recommended by the panel of experts, the results suggested that the TCF test forms under study primarily assess examinees’ understanding of explicitly stated information in aural discourse, thereby underrepresenting the listening construct. A few items were found to target the ability to infer implicit ideas and understanding of the general topic or main idea, however, this latter subskill was only found in one test form, suggesting that the forms are not equivalent. The DIF analysis flagged multiple items across test forms and between the subgroups of interest, but very few were associated to potential bias, which included speech perception, literary genre and vocabulary familiarity. Thus, given that many items flagged for DIF could not be associated to a potential bias, this question was partially answered and attenuates the validity argument. The results from the NRM suggested that most items functioned well while others were potentially doubled keyed. The argument-based approached to validity proved helpful in putting together empirical evidence into a coherent whole to support and build a case for the interpretation and uses of the TCF in the context of immigration, which in turn can be used to address the identified weaknesses, providing a means to attenuate the potential rebuttals that threaten the validity of the argument. Some caveats in the validation framework were also outlined and relate to the accessibility of data to address the extrapolation and decision inferences in immigration contexts, but as Newton and Shaw (2014, p. 142) advocated “the argument-based approach underlies the fact that validation is not simply a one-off-study but a program: potentially a very intensive program”. And this program can include key stakeholders such as government officials that help complete the validity argument for the TCF in Quebec immigration

    Advancing Human Assessment: The Methodological, Psychological and Policy Contributions of ETS

    Get PDF
    ​This book describes the extensive contributions made toward the advancement of human assessment by scientists from one of the world’s leading research institutions, Educational Testing Service. The book’s four major sections detail research and development in measurement and statistics, education policy analysis and evaluation, scientific psychology, and validity. Many of the developments presented have become de-facto standards in educational and psychological measurement, including in item response theory (IRT), linking and equating, differential item functioning (DIF), and educational surveys like the National Assessment of Educational Progress (NAEP), the Programme of international Student Assessment (PISA), the Progress of International Reading Literacy Study (PIRLS) and the Trends in Mathematics and Science Study (TIMSS). In addition to its comprehensive coverage of contributions to the theory and methodology of educational and psychological measurement and statistics, the book gives significant attention to ETS work in cognitive, personality, developmental, and social psychology, and to education policy analysis and program evaluation. The chapter authors are long-standing experts who provide broad coverage and thoughtful insights that build upon decades of experience in research and best practices for measurement, evaluation, scientific psychology, and education policy analysis. Opening with a chapter on the genesis of ETS and closing with a synthesis of the enormously diverse set of contributions made over its 70-year history, the book is a useful resource for all interested in the improvement of human assessment

    Testing in the Professions

    Get PDF
    Testing in the Professions focuses on current practices in credentialing testing as a guide for practitioners. With a broad focus on the key components, issues, and concerns surrounding the test development and validation process, this book brings together a wide range of research and theory—from design and analysis of tests to security, scoring, and reporting. Written by leading experts in the field of measurement and assessment, each chapter includes authentic examples as to how various practices are implemented or current issues observed in credentialing programs. The volume begins with an exploration of the various types of credentialing programs as well as key differences in the interpretation and evaluation of test scores. The next set of chapters discusses key test development steps, including test design, content development, analysis, and evaluation. The final set of chapters addresses specific topics that span the testing process, including communication with stakeholders, security, program evaluation, and legal principles. As a response to the growing number of professions and professional designations that are tied to testing requirements, Testing in the Professions is a comprehensive source for up-to-date measurement and credentialing practices

    Advancing Human Assessment: The Methodological, Psychological and Policy Contributions of ETS

    Get PDF
    Educational Testing Service (ETS); large-scale assessment; policy research; psychometrics; admissions test

    Assessing the Impact of Characteristics of the Test, Common-items, and Examinees on the Preservation of Equity Properties in Mixed-format Test Equating

    Get PDF
    Preservation of equity properties was examined using four equating methods - IRT True Score, IRT Observed Score, Frequency Estimation, and Chained Equipercentile - in a mixed-format test under a common-item nonequivalent groups (CINEG) design. Equating of mixed-format tests under a CINEG design can be influenced by factors such as attributes of the test, the common-item set, and examinees. Additionally, unidimensionality may not hold due to the inclusion of multiple item formats. Different item formats could measure different latent constructs and thus cause a multidimensional test structure. The purpose of this study was to examine the impact of test structures (unidimensional versus within-item multidimensional as modeled through a bifactor model), differences in group ability distributions (equivalent versus nonequivalent), and characteristics of the common-item set (format representative versus non-representative) on each equating method’s ability to preserve equity properties. The major findings can be summarized as follows: IRT equating methods outperformed traditional equating methods in terms of equity preservation across all conditions. Traditional equating methods performed similarly when groups were equivalent. However, large discrepancies between the methods were found as a direct function of an increase in mean group ability differences. The IRT true score method was most successful in terms of First-Order Equity preservation regardless of test structure. All methods preserved Second-Order Equity similarly under unidimensional test structures. The IRT true score method was superior to all other equating methods in terms of Second-Order Equity when the test structures were multidimensional. Similar results in terms of the Same Distribution property were obtained for each method when the groups were equivalent. The IRT observed score method was the best preserving when mean group ability differences increased. This was observed regardless of underlying test structure. Lower equity indices were observed when the common-item set was representative of the total test in particular when group differences were large. Similar patterns in terms of the performance of equating methods were observed regardless of the underlying test structure. These results are discussed within the literature framework as it pertains to mixed-format test equating. Limitations of the current study are discussed and suggestions for future research are provided

    Testing in the Professions

    Get PDF
    Testing in the Professions focuses on current practices in credentialing testing as a guide for practitioners. With a broad focus on the key components, issues, and concerns surrounding the test development and validation process, this book brings together a wide range of research and theory—from design and analysis of tests to security, scoring, and reporting. Written by leading experts in the field of measurement and assessment, each chapter includes authentic examples as to how various practices are implemented or current issues observed in credentialing programs. The volume begins with an exploration of the various types of credentialing programs as well as key differences in the interpretation and evaluation of test scores. The next set of chapters discusses key test development steps, including test design, content development, analysis, and evaluation. The final set of chapters addresses specific topics that span the testing process, including communication with stakeholders, security, program evaluation, and legal principles. As a response to the growing number of professions and professional designations that are tied to testing requirements, Testing in the Professions is a comprehensive source for up-to-date measurement and credentialing practices

    A comparison of traditional test blueprinting and item development to assessment engineering in a licensure context

    Get PDF
    With the need for larger and larger banks of items to support adaptive testing and to meet security concerns, large-scale item generation is a requirement for many certification and licensure programs. As part of the mass production of items, it is critical that the difficulty and the discrimination of the items be known without the need for pretesting. One approach to solving this need is item templating, an assessment engineering (AE) approach that is intended to control item difficulty and other psychometric operating characteristics for a class of items developed from each template. There are important advantages that can accrue to having exchangeable items that operate in a psychometrically similar manner in terms of item bank development (reduced time and lower cost to develop), pretesting efficiency, test security, and so forth.This study describes one method to use AE and item templates in a licensure context to yield sets of items with statistical characteristics that match the needs of the program with reduced need for pilot testing. It is shown that item variants developed in this method fit the Rasch calibration/scoring model as well, if not better than items developed in traditional ways and that the item variants from the same template yield similar classical and IRT statistics. One key result of the study is a method to use AE to evaluate the performance of item writers over time

    Theoretical and Practical Advances in Computer-based Educational Measurement

    Get PDF
    This open access book presents a large number of innovations in the world of operational testing. It brings together different but related areas and provides insight in their possibilities, their advantages and drawbacks. The book not only addresses improvements in the quality of educational measurement, innovations in (inter)national large scale assessments, but also several advances in psychometrics and improvements in computerized adaptive testing, and it also offers examples on the impact of new technology in assessment. Due to its nature, the book will appeal to a broad audience within the educational measurement community. It contributes to both theoretical knowledge and also pays attention to practical implementation of innovations in testing technology
    • …
    corecore