6,254 research outputs found

    A Study of Chance-Corrected Agreement Coefficients for the Measurement of Multi-Rater Consistency

    Get PDF
    Chance corrected agreement coefficients such as the Cohen and Fleiss Kappas are commonly used for the measurement of consistency in the decisions made by clinical observers or raters. However, the way that they estimate the probability of agreement (Pe) or cost of disagreement (De) 'by chance' has been strongly questioned, and alternatives have been proposed, such as the Aickin Alpha coefficient and the Gwet AC1 and AC2 coefficients. A well known paradox illustrates deficiencies of the Kappa coefficients which may be remedied by scaling Pe or De according to the uniformity of the scoring. The AC1 and AC2 coefficients result from the application of this scaling to the Brennan-Prediger coefficient which may be considered a simplified form of Kappa. This paper examines some commonly used multi-rater agreement coefficients including AC1 and AC2. It then proposes an alternative subject-by-subject scaling approach that may be applied to weighted and unweighted multi-rater Cohen and Fleiss Kappas and also Intra-Class Correlation (ICC) coefficients

    A review of agreement measure as a subset of association measure between raters

    Get PDF
    Agreement can be regarded as a special case of association and not the other way round. Virtually in all life or social science researches, subjects are being classified into categories by raters, interviewers or observers and both association and agreement measures can be obtained from the results of this researchers. The distinction between association and agreement for a given data is that, for two responses to be perfectly associated we require that we can predict the category of one response from the category of the other response, while for two response to agree, they must fall into the identical category. Which hence mean, once there is agreement between the two responses, association has already exist, however, strong association may exist between the two responses without any strong agreement. Many approaches have been proposed by various authors for measuring each of these measures. In this work, we present some up till date development on these measures statistics

    Reliability Evidence for the NC Teacher Evaluation Process Using a Variety of Indicators of Inter-Rater Agreement

    Get PDF
    In this study, various statistical indexes of agreement were calculated using empirical data from a group of evaluators (n = 45) of early childhood teachers. The group of evaluators rated ten fictitious teacher profiles using the North Carolina Teacher Evaluation Process (NCTEP) rubric. The exact and adjacent agreement percentages were calculated for the group of evaluators. Kappa, weighted Kappa, Gwet’s AC1, Gwet’s AC2, and ICCs were used to interpret the level of agreement between the group of raters and a panel of expert raters. Similar to previous studies, Kappa statistics were low in the presence of high levels of agreement. Weighted Kappa and Gwet’s AC1 were less conservative than Kappa values. Gwet’s AC2 statistic was not defined for most evaluators, as there was an issue found with the statistic when raters do not use each category on the rating scale a minimum number of times. Overall, summary statistics for exact agreement were 68.7% and 87.6% for adjacent agreement across 2,250 ratings (45 evaluators ratings of ten profiles across five NCTEP Standards). Inter-rater agreement coefficients varied from .486 for Kappa, .563 for Gwet’s AC1, .667 for weighted Kappa, and .706 for Gwet’s AC2. While each statistic yielded different results for the same data, the inter-rater reliability of evaluators of early childhood teachers was acceptable or higher for the majority of this group of raters when described with summary statistics and using precise measures of inter-rater reliability

    Many-Faceted Rasch Modeling Expert Judgment in Test Development

    Get PDF
    The purpose of this study was to model expert judgment in test and instrument development using the many-faceted Rasch model. A 150-item value orientation inventory-2 (VOI-2) assessing the value of physical education curriculum goals was developed and evaluated by 128 university educators and 103 school-based physical educators. The experts were asked to rate the consistency of each item to represent one part of the broad curriculum goals using a 5-point rating scale. The many-faceted Rasch model was used to calibrate the rating scores, and 6 facets—gender, ethnicity, employment type, rater, content area, and item—were defined. Severity and consistency of the experts' judgments were examined and corrected before being applied to item evaluation. Further, the impact of group membership on expert judgment was examined. Items were then evaluated based on their logit scores and the consistency of their performance. Results suggest that most VOI-2 items were content domain representative and the raters were truly experts. The many-faceted Rasch model demonstrates a psychometrically appropriate technique for applying expert judgment in test development

    Assessing and inferring intra and inter-rater agreement

    Get PDF
    The research work wants to provide a scientific contribution in the field of subjective decision making since the assessment of the consensus, or equivalently the degree of agreement, among a group of raters as well as between more series of evaluations provided by the same rater, on categorical scales is a subject of both scientific and practical interest. Specifically, the research work focuses on the analysis of measures of agreement commonly adopted for assessing the performance (evaluative abilities) of one or more human raters (i.e. a group of raters) providing subjective evaluations about a given set of items/subjects. This topic is common to many contexts, ranging from medical (diagnosis) to engineering (usability test), industrial (visual inspections) or agribusiness (sensory analysis) contexts. In the thesis work, the performance of the agreement indexes under study, belonging to the family of the kappa-type agreement coefficients, have been assessed mainly regarding their inferential aspects, focusing the attention on those scenarios with small sample sizes which do not satisfy the asymptotic conditions required for the applicability of the standard inferential methods. Those scenarios have been poorly investigated in the specialized literature, although there is an evident interest in many experimental contexts. The critical analysis of the specialized literature highlighted two criticisms regarding the adoption of the agreement coefficients: 1) the degree of agreement is generally characterized by a straightforward benchmarking procedure that does not take into account the sampling uncertainty; 2) there is no evidence in the literature of a synthetic index able to assess the performance of a rater and/or of a group of raters in terms of more than one evaluative abilities (for example repeatability and reproducibility). Regarding the former criticism, an inferential benchmarking procedure based on non parametric confidence intervals, build via bootstrap resampling techniques, has been suggested. The statistical properties of the suggested benchmarking procedure have been investigated via a Monte Carlo simulation study by exploring many scenarios defined by varying: level of agreement, sample size and rating scale dimension. The simulation study has been carried out for different agreement coefficients and building different confidence intervals, in order to provide a comparative analysis of their performances. Regarding the latter criticism, instead, has been proposed a novel composite index able to assess the rater abilities of providing both repeatable (i.e. stable over time) and reproducible (i.e. consistent over different rating scales) evaluations. The inferential benchmarking procedure has been extended also to the proposed composite index and their performances have been investigated under different scenarios via a Monte Carlo simulation. The proposed tools have been successfully applied to two real case studies, about the assessment of university teaching quality and the sensory analysis of some food and beverage products, respectively

    Inter-rater reliability of treatment fidelity and therapeutic alliance measures for psychological therapies for anxiety in young people with autism spectrum disorders

    Get PDF
    Objectives: This article presents work undertaken to establish inter-rater reliability for a measure of treatment fidelity and a measure of therapeutic alliance for therapies for anxiety for young people with autism spectrum disorders. The discussion and decision-making processes behind achieving consensus of raters are rarely published. Margolin et al. (1998) have highlighted this issue and called for researchers to communicate the details of their observational and rating procedures. This article is a response to their call for greater transparency so that these methods are readily accessible for comparison with other studies. Methods: Participants were young people with autism spectrum disorders receiving treatment for anxiety, clinical staff treating these young people and the independent raters assessing the treatment sessions. We report: (i) the processes involved in establishing inter-rater reliability for two instruments, (ii) the results obtained with a sample of young people with autism spectrum disorders using these instruments. Results and conclusions: Results demonstrate that it was possible to attain satisfactory inter-rater reliability with each of these two instruments with a client group with autism spectrum disorders, even though the instruments were originally designed for typically-developing populations

    Inter-rater reliability of treatment fidelity and therapeutic alliance measures for psychological therapies for anxiety in young people with autism spectrum disorders

    Get PDF
    Objectives: This article presents work undertaken to establish inter-rater reliability for a measure of treatment fidelity and a measure of therapeutic alliance for therapies for anxiety for young people with autism spectrum disorders. The discussion and decision-making processes behind achieving consensus of raters are rarely published. Margolin et al. (1998) have highlighted this issue and called for researchers to communicate the details of their observational and rating procedures. This article is a response to their call for greater transparency so that these methods are readily accessible for comparison with other studies. Methods: Participants were young people with autism spectrum disorders receiving treatment for anxiety, clinical staff treating these young people and the independent raters assessing the treatment sessions. We report: (i) the processes involved in establishing inter-rater reliability for two instruments, (ii) the results obtained with a sample of young people with autism spectrum disorders using these instruments. Results and conclusions: Results demonstrate that it was possible to attain satisfactory inter-rater reliability with each of these two instruments with a client group with autism spectrum disorders, even though the instruments were originally designed for typically-developing populations

    Comparisons of Artifact Correction Procedures for Meta-Analysis: An Empirical Examination on Correcting Reliabilities

    Get PDF
    This study reviewed some challenges and issues in artifact correction meta-analysis, particularly around using reliability estimates to correct for measurement error. Two individual correction procedures—the Hunter-Schmidt procedure and the procedure developed by Raju, Burke, Normand, and Langlois (the RBNL procedure)—are addressed in this research. The purpose of this study is to use real-world data to examine the differences between meta-analytic estimations produced by the two artifact correction procedures and those by the traditional bare-bones meta-analysis procedures, under the condition of inter-dependent reliabilities. The impact of this inter-correlation on meta-analysis results needs investigation when artifact indicators, such as reliability of predictor and reliability of outcome, are proven to be significantly inter-correlated. The current study revealed that neither the choice of artifact correction nor the choice of analysis procedure provided any significant differences in the estimation results, whereas it was the choice of the reliability estimates that generated noticeable differences in the results. In addition, the violation of the assumption for independent reliability did not greatly impact the meta-analytic estimation results
    corecore