192 research outputs found
Inequalities between multi-rater kappas
Multivariate analysis of psychological data - ou
Reformulation and Generalisation of the Cohen and Fleiss Kappas
The assessment of consistency in the categorical or ordinal decisions made by observers or raters is an important problem especially in the medical field. The Fleiss Kappa, Cohen Kappa and Intra-class Correlation (ICC), as commonly used for this purpose, are compared and a generalised approach to these measurements is presented. Differences between the Fleiss Kappa and multi-rater versions of the Cohen Kappa are explained and it is shown how both may be applied to ordinal scoring with linear, quadratic or other weighting. The relationship between quadratically weighted Fleiss and Cohen Kappa and pair-wise ICC is clarified and generalised to multi-rater assessments. The AC1 coefficient is considered as an alternative measure of consistency and the relevance of the Kappas and AC1 to measuring content validity is explore
A Comparison of Reliability Coefficients for Ordinal Rating Scales
Kappa coefficients are commonly used for quantifying reliability on a categorical scale, whereas correlation coefficients are commonly applied to assess reliability on an interval scale. Both types of coefficients can be used to assess the reliability of ordinal rating scales. In this study, we compare seven reliability coefficients for ordinal rating scales: the kappa coefficients included are Cohen’s kappa, linearly weighted kappa, and quadratically weighted kappa; the correlation coefficients included are intraclass correlation ICC(3,1), Pearson’s correlation, Spearman’s rho, and Kendall’s tau-b. The primary goal is to provide a thorough understanding of these coefficients such that the applied researcher can make a sensible choice for ordinal rating scales. A second aim is to find out whether the choice of the coefficient matters. We studied to what extent we reach the same conclusions about inter-rater reliability with different coefficients, and to what extent the coefficients measure agreement in a similar way, using analytic methods, and simulated and empirical data. Using analytical methods, it is shown that differences between quadratic kappa and the Pearson and intraclass correlations increase if agreement becomes larger. Differences between the three coefficients are generally small if differences between rater means and variances are small. Furthermore, using simulated and empirical data, it is shown that differences between all reliability coefficients tend to increase if agreement between the raters increases. Moreover, for the data in this study, the same conclusion about inter-rater reliability was reached in virtually all cases with the four correlation coefficients. In addition, using quadratically weighted kappa, we reached a similar conclusion as with any correlation coefficient a great number of times. Hence, for the data in this study, it does not really matter which of these five coefficients is used. Moreover, the four correlation coefficients and quadratically weighted kappa tend to measure agreement in a similar way: their values are very highly correlated for the data in this study
Reliability of an Observational Method Used to Assess Tennis Serve Mechanics in a Group of Novice Raters
Background: Previous research has developed an observational tennis serve analysis (OTSA) tool to assess serve mechanics. The OTSA has displayed substantial agreement between the two health care professionals that developed the tool; however, it is currently unknown if the OTSA is reliable when administered by novice users.
Purpose: The purpose of this investigation was to determine if reliability for the OTSA could be established in novice users via an interactive classroom training session.
Methods: Eight observers underwent a classroom instructional training protocol highlighting the OTSA. Following training, observers participated in two different rating sessions approximately a week apart. Each observer independently viewed 16 non-professional tennis players performing a first serve. All observers were asked to rate the tennis serve using the OTSA. Both intra and inter-observer reliability were determined using Kappa coefficients.
Results: Kappa coefficients for intra and inter-observer agreement ranged from 0.09 to 0.83 depending on the body position. A majority of all body positions yeilded moderate agreement and higher.
Conclusion: This study suggests that the majority of components associated with the OTSA are reliable and can be taught to novice users via a classroom training session
A family of multi-rater kappas that can always be increased and decreased by combining categories
FSW – Publicaties zonder aanstelling Universiteit Leide
The problem with Kappa
It is becoming clear that traditional
evaluation measures used in
Computational Linguistics (including
Error Rates, Accuracy, Recall, Precision
and F-measure) are of limited value for
unbiased evaluation of systems, and are
not meaningful for comparison of
algorithms unless both the dataset and
algorithm parameters are strictly
controlled for skew (Prevalence and
Bias). The use of techniques originally
designed for other purposes, in particular
Receiver Operating Characteristics Area
Under Curve, plus variants of Kappa,
have been proposed to fill the void.
This paper aims to clear up some of the
confusion relating to evaluation, by
demonstrating that the usefulness of each
evaluation method is highly dependent on
the assumptions made about the
distributions of the dataset and the
underlying populations. The behaviour of
a number of evaluation measures is
compared under common assumptions.
Deploying a system in a context which
has the opposite skew from its validation
set can be expected to approximately
negate Fleiss Kappa and halve Cohen
Kappa but leave Powers Kappa
unchanged. For most performance
evaluation purposes, the latter is thus
most appropriate, whilst for comparison
of behaviour, Matthews Correlation is
recommended
Corrected Zegers-ten Berge coefficients are special cases of Cohen's weighted kappa
Multivariate analysis of psychological dat
Kappa coefficients for dichotomous-nominal classifications
Two types of nominal classifications are distinguished, namely regular nominal classifications and dichotomous-nominal classifications. The first type does not include an 'absence' category (for example, no disorder), whereas the second type does include an 'absence' category. Cohen's unweighted kappa can be used to quantify agreement between two regular nominal classifications with the same categories, but there are no coefficients for assessing agreement between two dichotomous-nominal classifications. Kappa coefficients for dichotomous-nominal classifications with identical categories are defined. All coefficients proposed belong to a one-parameter family. It is studied how the coefficients for dichotomous-nominal classifications are related and if the values of the coefficients depend on the number of categories. It turns out that the values of the new kappa coefficients can be strictly ordered in precisely two ways. The orderings suggest that the new coefficients are measuring the same thing, but to a different extent. If one accepts the use of magnitude guidelines, it is recommended to use stricter criteria for the new coefficients that tend to produce higher values
A comparison of Cohen's kappa and agreement coefficients by Corrado Gini
Multivariate analysis of psychological dat
A comparison of multi-way similarity coefficients for binary sequences
Multivariate analysis of psychological dat
- …