1,212 research outputs found

    An Examination of Parameter Recovery Using Different Multiple Matrix Booklet Designs

    Get PDF
    Educational large-scale assessments examine students’ achievement in various content domains and thus provide key findings to inform educational research and evidence-based educational policies. To this end, large-scale assessments involve hundreds of items to test students’ achievement in various content domains. Administering all these items to single students will over-burden them, reduce participation rates, and consume too much time and resources. Hence multiple matrix sampling is used in which the test items are distributed into various test forms called “booklets”; and each student administered a booklet, containing a subset of items that can sensibly be answered during the allotted test timeframe. However, there are numerous possibilities as to how these booklets can be designed, and this manner of booklet design could influence parameter recovery precision both at global and subpopulation levels. One popular booklet design with many desirable characteristics is the Balanced Incomplete 7-Block or Youden squares design. Extensions of this booklet design are used in many large-scale assessments like TIMSS and PISA. This doctoral project examines the degree to which item and population parameters are recovered in real and simulated data in relation to matrix sparseness, when using various balanced incomplete block booklet designs. To this end, key factors (e.g., number of items, number of persons, number of items per person, and the match between the distributions of item and person parameters) are experimentally manipulated to learn how these factors affect the precision with which these designs recover true population parameters. In doing so, the project expands the empirical knowledge base on the statistical properties of booklet designs, which in turn could help improve the design of future large-scale studies. Generally, the results show that for a typical large-scale assessment (with a sample size of at least 3,000 students and more than 100 test items), population and item parameters are recovered accurately and without bias in the various multi-matrix booklet designs. This is true both at the global population level and at the subgroup or sub-population levels. Further, for such a large-scale assessment, the match between the distribution of person abilities and the distribution of item difficulties is found to have an insignificant effect on the precision with which person and item parameters are recovered, when using these multi-matrix booklet designs. These results give further support to the use of multi-matrix booklet designs as a reliable test abridgment technique in large-scale assessments, and for accurate measurement of performance gaps between policy-relevant subgroups within populations. However, item position effects were not fully considered, and different results are possible if similar studies are performed (a) with conditions involving items that poorly measure student abilities (e.g., with students having skewed ability distributions); or, (b) simulating conditions where there is a lot of missing data because of non-response, instead of just missing by design. This should be further investigated in future studies.Die Erfassung des Leistungsstands von SchĂŒlerinnen und SchĂŒlern in verschiedenen DomĂ€nen durch groß angelegte Schulleistungsstudien (sog. Large-Scale Assessments) liefert wichtige Erkenntnisse fĂŒr die Bildungsforschung und die evidenzbasierte Bildungspolitik. Jedoch erfordert die Leistungstestung in vielen Themenbereichen auch immer den Einsatz hunderter Items. WĂŒrden alle Testaufgaben jeder einzelnen SchĂŒlerin bzw. jedem einzelnen SchĂŒler vorgelegt werden, wĂŒrde dies eine zu große Belastung fĂŒr die SchĂŒlerinnen und SchĂŒler darstellen und folglich wĂ€ren diese auch weniger motiviert, alle Aufgaben zu bearbeiten. Zudem wĂ€re der Einsatz aller Aufgaben in der gesamten Stichprobe sehr zeit- und ressourcenintensiv. Aus diesen GrĂŒnden wird in Large-Scale Assessments oft auf ein Multi- Matrix Design zurĂŒckgegriffen bei dem verschiedene, den Testpersonen zufĂ€llig zugeordnete, Testheftversionen (sog. Booklets) zum Einsatz kommen. Diese enthalten nicht alle Aufgaben, sondern lediglich eine Teilmenge des Aufgabenpools, wobei nur ein Teil der Items zwischen den verschiedenen Booklets ĂŒberlappt. Somit wird sichergestellt, dass die SchĂŒlerinnen und SchĂŒler alle ihnen vorgelegten Items in der vorgegebenen Testzeit bearbeiten können. Jedoch gibt es zahlreiche Varianten wie diese Booklets zusammengestellt werden können. Das jeweilige Booklet Design hat wiederum Auswirkungen auf die Genauigkeit der ParameterschĂ€tzung auf Populations- und Teilpopulationsebene. Ein bewĂ€hrtes Booklet Design ist das Balanced-Incomplete-7-Block Design, auch Youden-Squares Design genannt, das in unterschiedlicher Form in vielen Large-Scale Assessments, wie z.B. TIMSS und PISA, Anwendung findet. Die vorliegende Arbeit untersucht sowohl auf Basis realer als auch simulierter Daten die Genauigkeit mit der Item- und Personenparameter unter Anwendung verschiedener Balanced-Incomplete-Block Designs und in AbhĂ€ngigkeit vom Anteil designbedingt fehlender Werte geschĂ€tzt werden können. DafĂŒr wurden verschiede Designparameter variiert (z.B. Itemanzahl, Stichprobenumfang, Itemanzahl pro Booklet, Ausmaß der Passung von Item- und Personenparametern) und anschließend analysiert, in welcher Weise diese die Genauigkeit der SchĂ€tzung von Populationsparametern beeinflussen. Die vorliegende Arbeit hat somit zum Ziel, das empirische Wissen um die statistischen Eigenschaften von Booklet Designs zu erweitern, wodurch ein Beitrag zur Verbesserung zukĂŒnftiger Large-Scale Assessments geleistet wird. Die Ergebnisse der vorliegenden Arbeit zeigten, dass fĂŒr ein typisches Large-Scale Assessment (mit einer StichprobengrĂ¶ĂŸe von mindestens 3000 SchĂŒlerinnen und SchĂŒlern und mindestens 100 Items) die Personen- und Itemparameter sowohl auf Populations- als auch auf Teilpopulationsebene mit allen eingesetzten Varianten des Balanced-Incomplete- Block Designs prĂ€zise geschĂ€tzt wurden. Außerdem konnte gezeigt werden, dass fĂŒr Stichproben mit mindestens 3000 SchĂŒlerinnen und SchĂŒlern die Passung zwischen der Leistungsverteilung und der Verteilung der Aufgabenschwierigkeit keinen bedeutsamen Einfluss auf die Genauigkeit hatte, mit der verschiedene Booklet Designs Personen- und Itemparameter schĂ€tzten. Die Ergebnisse untermauern, dass unter Verwendung von multi-matrix Designs bildungspolitisch relevante Leistungsunterschiede zwischen Gruppen von SchĂŒlerinnen und SchĂŒlern in der Population reliabel und prĂ€zise geschĂ€tzt werden können. Eine EinschrĂ€nkung der vorliegenden Studie liegt darin, dass Itempositionseffekte nicht umfassend berĂŒcksichtigt wurden. So kann nicht ausgeschlossen werden, dass die Ergebnisse abweichen wĂŒrden, wenn (a) Items verwendet werden wĂŒrden, welche die Leistung der SchĂŒlerinnen und SchĂŒler schlecht schĂ€tzen (z.B. bei einer schiefen Verteilungen der Leistungswerte) oder (b) hohe Anteile an fehlenden Werten vorliegen, die nicht durch das Multi-Matrix Design erzeugt wurden. Dies sollte in zukĂŒnftigen Studien untersucht werden

    A governance framework for algorithmic accountability and transparency

    Get PDF
    Algorithmic systems are increasingly being used as part of decision-making processes in both the public and private sectors, with potentially significant consequences for individuals, organisations and societies as a whole. Algorithmic systems in this context refer to the combination of algorithms, data and the interface process that together determine the outcomes that affect end users. Many types of decisions can be made faster and more efficiently using algorithms. A significant factor in the adoption of algorithmic systems for decision-making is their capacity to process large amounts of varied data sets (i.e. big data), which can be paired with machine learning methods in order to infer statistical models directly from the data. The same properties of scale, complexity and autonomous model inference however are linked to increasing concerns that many of these systems are opaque to the people affected by their use and lack clear explanations for the decisions they make. This lack of transparency risks undermining meaningful scrutiny and accountability, which is a significant concern when these systems are applied as part of decision-making processes that can have a considerable impact on people's human rights (e.g. critical safety decisions in autonomous vehicles; allocation of health and social service resources, etc.). This study develops policy options for the governance of algorithmic transparency and accountability, based on an analysis of the social, technical and regulatory challenges posed by algorithmic systems. Based on a review and analysis of existing proposals for governance of algorithmic systems, a set of four policy options are proposed, each of which addresses a different aspect of algorithmic transparency and accountability: 1. awareness raising: education, watchdogs and whistleblowers; 2. accountability in public-sector use of algorithmic decision-making; 3. regulatory oversight and legal liability; and 4. global coordination for algorithmic governance

    Determination and evaluation of clinically efficient stopping criteria for the multiple auditory steady-state response technique

    Get PDF
    Background: Although the auditory steady-state response (ASSR) technique utilizes objective statistical detection algorithms to estimate behavioural hearing thresholds, the audiologist still has to decide when to terminate ASSR recordings introducing once more a certain degree of subjectivity. Aims: The present study aimed at establishing clinically efficient stopping criteria for a multiple 80-Hz ASSR system. Methods: In Experiment 1, data of 31 normal hearing subjects were analyzed off-line to propose stopping rules. Consequently, ASSR recordings will be stopped when (1) all 8 responses reach significance and significance can be maintained for 8 consecutive sweeps; (2) the mean noise levels were ≀ 4 nV (if at this “≀ 4-nV” criterion, p-values were between 0.05 and 0.1, measurements were extended only once by 8 sweeps); and (3) a maximum amount of 48 sweeps was attained. In Experiment 2, these stopping criteria were applied on 10 normal hearing and 10 hearing-impaired adults to asses the efficiency. Results: The application of these stopping rules resulted in ASSR threshold values that were comparable to other multiple-ASSR research with normal hearing and hearing-impaired adults. Furthermore, in 80% of the cases, ASSR thresholds could be obtained within a time-frame of 1 hour. Investigating the significant response-amplitudes of the hearing-impaired adults through cumulative curves indicated that probably a higher noise-stop criterion than “≀ 4 nV” can be used. Conclusions: The proposed stopping rules can be used in adults to determine accurate ASSR thresholds within an acceptable time-frame of about 1 hour. However, additional research with infants and adults with varying degrees and configurations of hearing loss is needed to optimize these criteria

    11th Annual Undergraduate Research Symposium

    Get PDF

    Reliability and validity of PROMIS measures administered by telephone interview in a longitudinal localized prostate cancer study

    Get PDF
    Purpose: To evaluate the reliability and validity of six PROMIS measures (anxiety, depression, fatigue, pain interference, physical function, and sleep disturbance) telephone-administered to a diverse, population-based cohort of localized prostate cancer patients. Methods: Newly diagnosed men were enrolled in the North Carolina Prostate Cancer Comparative Effectiveness and Survivorship Study. PROMIS measures were telephone-administered pre-treatment (baseline), and at 3-months and 12-months post-treatment initiation (N = 778). Reliability was evaluated using Cronbach’s alpha. Dimensionality was examined with bifactor models and explained common variance (ECV). Ordinal logistic regression models were used to detect potential differential item functioning (DIF) for key demographic groups. Convergent and discriminant validity were assessed by correlations with the legacy instruments Memorial Anxiety Scale for Prostate Cancer and SF-12v2. Known-groups validity was examined by age, race/ethnicity, comorbidity, and treatment. Results: Each PROMIS measure had high Cronbach’s alpha values (0.86–0.96) and was sufficiently unidimensional. Floor effects were observed for anxiety, depression, and pain interference measures; ceiling effects were observed for physical function. No DIF was detected. Convergent validity was established with moderate to strong correlations between PROMIS and legacy measures (0.41–0.77) of similar constructs. Discriminant validity was demonstrated with weak correlations between measures of dissimilar domains (−0.20–−0.31). PROMIS measures detected differences across age, race/ethnicity, and comorbidity groups; no differences were found by treatment. Conclusions: This study provides support for the reliability and construct validity of six PROMIS measures in prostate cancer, as well as the utility of telephone administration for assessing HRQoL in low literacy and hard-to-reach populations

    Requires improvement : urgent change for 11–16 education

    Get PDF

    Others\u27 publications about EHDI

    Get PDF

    The Use of ICT for the Assessment of Key Competences

    Get PDF
    This report assesses current trends in the area of ICT for learning and assessment in view of their value for supporting the assessment of Key Competences. Based on an extensive review of the literature, it provides an overview of current ICT-enabled assessment practices, with a particular focus on more recent developments that support the holistic assessment of Key Competences for Lifelong Learning in Europe. The report presents a number of relevant cases, discusses the potential of emerging technologies, and addresses innovation and policy issues for eAssessment. It considers both summative and formative assessment and considers how ICT can lever the potential of more innovative assessment formats, such as peer-assessment and portfolio assessment and how more recent technological developments, such as Learning Analytics, could, in the future, foster assessment for learning. Reflecting on the use of the different ICT tools and services for each of the eight different Key Competences for Lifelong Learning it derives policy options for further exploiting the potential of ICT for competence-based assessment.JRC.J.3-Information Societ
    • 

    corecore