1,212 research outputs found
An Examination of Parameter Recovery Using Different Multiple Matrix Booklet Designs
Educational large-scale assessments examine studentsâ achievement in various content
domains and thus provide key findings to inform educational research and evidence-based
educational policies. To this end, large-scale assessments involve hundreds of items to test
studentsâ achievement in various content domains. Administering all these items to single
students will over-burden them, reduce participation rates, and consume too much time and
resources. Hence multiple matrix sampling is used in which the test items are distributed into
various test forms called âbookletsâ; and each student administered a booklet, containing a
subset of items that can sensibly be answered during the allotted test timeframe. However,
there are numerous possibilities as to how these booklets can be designed, and this manner of booklet design could influence parameter recovery precision both at global and subpopulation levels. One popular booklet design with many desirable characteristics is the
Balanced Incomplete 7-Block or Youden squares design. Extensions of this booklet design
are used in many large-scale assessments like TIMSS and PISA. This doctoral project
examines the degree to which item and population parameters are recovered in real and
simulated data in relation to matrix sparseness, when using various balanced incomplete
block booklet designs. To this end, key factors (e.g., number of items, number of persons,
number of items per person, and the match between the distributions of item and person
parameters) are experimentally manipulated to learn how these factors affect the precision
with which these designs recover true population parameters. In doing so, the project expands
the empirical knowledge base on the statistical properties of booklet designs, which in turn
could help improve the design of future large-scale studies.
Generally, the results show that for a typical large-scale assessment (with a sample size of at
least 3,000 students and more than 100 test items), population and item parameters are recovered accurately and without bias in the various multi-matrix booklet designs. This is
true both at the global population level and at the subgroup or sub-population levels. Further,
for such a large-scale assessment, the match between the distribution of person abilities and
the distribution of item difficulties is found to have an insignificant effect on the precision
with which person and item parameters are recovered, when using these multi-matrix booklet
designs.
These results give further support to the use of multi-matrix booklet designs as a reliable test
abridgment technique in large-scale assessments, and for accurate measurement of
performance gaps between policy-relevant subgroups within populations. However, item position effects were not fully considered, and different results are possible if similar studies
are performed (a) with conditions involving items that poorly measure student abilities (e.g.,
with students having skewed ability distributions); or, (b) simulating conditions where there
is a lot of missing data because of non-response, instead of just missing by design. This
should be further investigated in future studies.Die Erfassung des Leistungsstands von SchĂŒlerinnen und SchĂŒlern in verschiedenen
DomÀnen durch groà angelegte Schulleistungsstudien (sog. Large-Scale Assessments) liefert
wichtige Erkenntnisse fĂŒr die Bildungsforschung und die evidenzbasierte Bildungspolitik.
Jedoch erfordert die Leistungstestung in vielen Themenbereichen auch immer den Einsatz
hunderter Items. WĂŒrden alle Testaufgaben jeder einzelnen SchĂŒlerin bzw. jedem einzelnen
SchĂŒler vorgelegt werden, wĂŒrde dies eine zu groĂe Belastung fĂŒr die SchĂŒlerinnen und
SchĂŒler darstellen und folglich wĂ€ren diese auch weniger motiviert, alle Aufgaben zu
bearbeiten. Zudem wÀre der Einsatz aller Aufgaben in der gesamten Stichprobe sehr zeit- und
ressourcenintensiv. Aus diesen GrĂŒnden wird in Large-Scale Assessments oft auf ein Multi-
Matrix Design zurĂŒckgegriffen bei dem verschiedene, den Testpersonen zufĂ€llig zugeordnete,
Testheftversionen (sog. Booklets) zum Einsatz kommen. Diese enthalten nicht alle Aufgaben,
sondern lediglich eine Teilmenge des Aufgabenpools, wobei nur ein Teil der Items zwischen
den verschiedenen Booklets ĂŒberlappt. Somit wird sichergestellt, dass die SchĂŒlerinnen und
SchĂŒler alle ihnen vorgelegten Items in der vorgegebenen Testzeit bearbeiten können. Jedoch
gibt es zahlreiche Varianten wie diese Booklets zusammengestellt werden können. Das
jeweilige Booklet Design hat wiederum Auswirkungen auf die Genauigkeit der
ParameterschÀtzung auf Populations- und Teilpopulationsebene. Ein bewÀhrtes Booklet
Design ist das Balanced-Incomplete-7-Block Design, auch Youden-Squares Design genannt,
das in unterschiedlicher Form in vielen Large-Scale Assessments, wie z.B. TIMSS und PISA,
Anwendung findet. Die vorliegende Arbeit untersucht sowohl auf Basis realer als auch
simulierter Daten die Genauigkeit mit der Item- und Personenparameter unter Anwendung
verschiedener Balanced-Incomplete-Block Designs und in AbhÀngigkeit vom Anteil
designbedingt fehlender Werte geschĂ€tzt werden können. DafĂŒr wurden verschiede
Designparameter variiert (z.B. Itemanzahl, Stichprobenumfang, Itemanzahl pro Booklet,
AusmaĂ der Passung von Item- und Personenparametern) und anschlieĂend analysiert, in
welcher Weise diese die Genauigkeit der SchĂ€tzung von Populationsparametern beeinflussen. Die vorliegende Arbeit hat somit zum Ziel, das empirische Wissen um die statistischen Eigenschaften von Booklet Designs zu erweitern, wodurch ein Beitrag zur Verbesserung zukĂŒnftiger Large-Scale Assessments geleistet wird.
Die Ergebnisse der vorliegenden Arbeit zeigten, dass fĂŒr ein typisches Large-Scale
Assessment (mit einer StichprobengröĂe von mindestens 3000 SchĂŒlerinnen und SchĂŒlern
und mindestens 100 Items) die Personen- und Itemparameter sowohl auf Populations- als
auch auf Teilpopulationsebene mit allen eingesetzten Varianten des Balanced-Incomplete-
Block Designs prĂ€zise geschĂ€tzt wurden. AuĂerdem konnte gezeigt werden, dass fĂŒr
Stichproben mit mindestens 3000 SchĂŒlerinnen und SchĂŒlern die Passung zwischen der
Leistungsverteilung und der Verteilung der Aufgabenschwierigkeit keinen bedeutsamen
Einfluss auf die Genauigkeit hatte, mit der verschiedene Booklet Designs Personen- und
Itemparameter schÀtzten.
Die Ergebnisse untermauern, dass unter Verwendung von multi-matrix Designs
bildungspolitisch relevante Leistungsunterschiede zwischen Gruppen von SchĂŒlerinnen und
SchĂŒlern in der Population reliabel und prĂ€zise geschĂ€tzt werden können. Eine
EinschrÀnkung der vorliegenden Studie liegt darin, dass Itempositionseffekte nicht umfassend
berĂŒcksichtigt wurden. So kann nicht ausgeschlossen werden, dass die Ergebnisse abweichen wĂŒrden, wenn (a) Items verwendet werden wĂŒrden, welche die Leistung der SchĂŒlerinnen und SchĂŒler schlecht schĂ€tzen (z.B. bei einer schiefen Verteilungen der Leistungswerte) oder (b) hohe Anteile an fehlenden Werten vorliegen, die nicht durch das Multi-Matrix Design erzeugt wurden. Dies sollte in zukĂŒnftigen Studien untersucht werden
A governance framework for algorithmic accountability and transparency
Algorithmic systems are increasingly being used as part of decision-making processes in both the public and private sectors, with potentially significant consequences for individuals, organisations and societies as a whole. Algorithmic systems in this context refer to the combination of algorithms, data and the interface process that together determine the outcomes that affect end users. Many types of decisions can be made faster and more efficiently using algorithms. A significant factor in the adoption of algorithmic systems for decision-making is their capacity to process large amounts of varied data sets (i.e. big data), which can be paired with machine learning methods in order to infer statistical models directly from the data. The same properties of scale, complexity and autonomous model inference however are linked to increasing concerns that many of these systems are opaque to the people affected by their use and lack clear explanations for the decisions they make. This lack of transparency risks undermining meaningful scrutiny and accountability, which is a significant concern when these systems are applied as part of decision-making processes that can have a considerable impact on people's human rights (e.g. critical safety decisions in autonomous vehicles; allocation of health and social service resources, etc.). This study develops policy options for the governance of algorithmic transparency and accountability, based on an analysis of the social, technical and regulatory challenges posed by algorithmic systems. Based on a review and analysis of existing proposals for governance of algorithmic systems, a set of four policy options are proposed, each of which addresses a different aspect of algorithmic transparency and accountability: 1. awareness raising: education, watchdogs and whistleblowers; 2. accountability in public-sector use of algorithmic decision-making; 3. regulatory oversight and legal liability; and 4. global coordination for algorithmic governance
Determination and evaluation of clinically efficient stopping criteria for the multiple auditory steady-state response technique
Background: Although the auditory steady-state response (ASSR) technique utilizes objective statistical detection algorithms to estimate behavioural hearing thresholds, the audiologist still has to decide when to terminate ASSR recordings introducing once more a certain degree of subjectivity.
Aims: The present study aimed at establishing clinically efficient stopping criteria for a multiple 80-Hz ASSR system.
Methods: In Experiment 1, data of 31 normal hearing subjects were analyzed off-line to propose stopping rules. Consequently, ASSR recordings will be stopped when (1) all 8 responses reach significance and significance can be maintained for 8 consecutive sweeps; (2) the mean noise levels were †4 nV (if at this â†4-nVâ criterion, p-values were between 0.05 and 0.1, measurements were extended only once by 8 sweeps); and (3) a maximum amount of 48 sweeps was attained. In Experiment 2, these stopping criteria were applied on 10 normal hearing and 10 hearing-impaired adults to asses the efficiency.
Results: The application of these stopping rules resulted in ASSR threshold values that were comparable to other multiple-ASSR research with normal hearing and hearing-impaired adults. Furthermore, in 80% of the cases, ASSR thresholds could be obtained within a time-frame of 1 hour. Investigating the significant response-amplitudes of the hearing-impaired adults through cumulative curves indicated that probably a higher noise-stop criterion than â†4 nVâ can be used.
Conclusions: The proposed stopping rules can be used in adults to determine accurate ASSR thresholds within an acceptable time-frame of about 1 hour. However, additional research with infants and adults with varying degrees and configurations of hearing loss is needed to optimize these criteria
Reliability and validity of PROMIS measures administered by telephone interview in a longitudinal localized prostate cancer study
Purpose: To evaluate the reliability and validity of six PROMIS measures (anxiety, depression, fatigue, pain interference, physical function, and sleep disturbance) telephone-administered to a diverse, population-based cohort of localized prostate cancer patients. Methods: Newly diagnosed men were enrolled in the North Carolina Prostate Cancer Comparative Effectiveness and Survivorship Study. PROMIS measures were telephone-administered pre-treatment (baseline), and at 3-months and 12-months post-treatment initiation (N = 778). Reliability was evaluated using Cronbachâs alpha. Dimensionality was examined with bifactor models and explained common variance (ECV). Ordinal logistic regression models were used to detect potential differential item functioning (DIF) for key demographic groups. Convergent and discriminant validity were assessed by correlations with the legacy instruments Memorial Anxiety Scale for Prostate Cancer and SF-12v2. Known-groups validity was examined by age, race/ethnicity, comorbidity, and treatment. Results: Each PROMIS measure had high Cronbachâs alpha values (0.86â0.96) and was sufficiently unidimensional. Floor effects were observed for anxiety, depression, and pain interference measures; ceiling effects were observed for physical function. No DIF was detected. Convergent validity was established with moderate to strong correlations between PROMIS and legacy measures (0.41â0.77) of similar constructs. Discriminant validity was demonstrated with weak correlations between measures of dissimilar domains (â0.20ââ0.31). PROMIS measures detected differences across age, race/ethnicity, and comorbidity groups; no differences were found by treatment. Conclusions: This study provides support for the reliability and construct validity of six PROMIS measures in prostate cancer, as well as the utility of telephone administration for assessing HRQoL in low literacy and hard-to-reach populations
The Use of ICT for the Assessment of Key Competences
This report assesses current trends in the area of ICT for learning and assessment in view of their value for supporting the assessment of Key Competences. Based on an extensive review of the literature, it provides an overview of current ICT-enabled assessment practices, with a particular focus on more recent developments that support the holistic assessment of Key Competences for Lifelong Learning in Europe. The report presents a number of relevant cases, discusses the potential of emerging technologies, and addresses innovation and policy issues for eAssessment. It considers both summative and formative assessment and considers how ICT can lever the potential of more innovative assessment formats, such as peer-assessment and portfolio assessment and how more recent technological developments, such as Learning Analytics, could, in the future, foster assessment for learning. Reflecting on the use of the different ICT tools and services for each of the eight different Key Competences for Lifelong Learning it derives policy options for further exploiting the potential of ICT for competence-based assessment.JRC.J.3-Information Societ
- âŠ