47,274 research outputs found

    Inter-Coder Agreement for Computational Linguistics

    Get PDF
    This article is a survey of methods for measuring agreement among corpus annotators. It exposes the mathematics and underlying assumptions of agreement coefficients, covering Krippendorff's alpha as well as Scott's pi and Cohen's kappa; discusses the use of coefficients in several annotation tasks; and argues that weighted, alpha-like coefficients, traditionally less used than kappa-like measures in computational linguistics, may be more appropriate for many corpus annotation tasks—but that their use makes the interpretation of the value of the coefficient even harder. </jats:p

    Analyzing collaborative learning processes automatically

    Get PDF
    In this article we describe the emerging area of text classification research focused on the problem of collaborative learning process analysis both from a broad perspective and more specifically in terms of a publicly available tool set called TagHelper tools. Analyzing the variety of pedagogically valuable facets of learners’ interactions is a time consuming and effortful process. Improving automated analyses of such highly valued processes of collaborative learning by adapting and applying recent text classification technologies would make it a less arduous task to obtain insights from corpus data. This endeavor also holds the potential for enabling substantially improved on-line instruction both by providing teachers and facilitators with reports about the groups they are moderating and by triggering context sensitive collaborative learning support on an as-needed basis. In this article, we report on an interdisciplinary research project, which has been investigating the effectiveness of applying text classification technology to a large CSCL corpus that has been analyzed by human coders using a theory-based multidimensional coding scheme. We report promising results and include an in-depth discussion of important issues such as reliability, validity, and efficiency that should be considered when deciding on the appropriateness of adopting a new technology such as TagHelper tools. One major technical contribution of this work is a demonstration that an important piece of the work towards making text classification technology effective for this purpose is designing and building linguistic pattern detectors, otherwise known as features, that can be extracted reliably from texts and that have high predictive power for the categories of discourse actions that the CSCL community is interested in

    Reliability measurement without limits

    Get PDF
    In computational linguistics, a reliability measurement of 0.8 on some statistic such as κ\kappa is widely thought to guarantee that hand-coded data is fit for purpose, with lower values suspect. We demonstrate that the main use of such data, machine learning, can tolerate data with a low reliability as long as any disagreement among human coders looks like random noise. When it does not, however, data can have a reliability of more than 0.8 and still be unsuitable for use: the disagreement may indicate erroneous patterns that machine-learning can learn, and evaluation against test data that contain these same erroneous patterns may lead us to draw wrong conclusions about our machine-learning algorithms. Furthermore, lower reliability values still held as acceptable by many researchers, between 0.67 and 0.8, may even yield inflated performance figures in some circumstances. Although this is a common sense result, it has implications for how we work that are likely to reach beyond the machine-learning applications we discuss. At the very least, computational linguists should look for any patterns in the disagreement among coders and assess what impact they will have

    Assessing agreement on classification tasks: the kappa statistic

    Full text link
    Currently, computational linguists and cognitive scientists working in the area of discourse and dialogue argue that their subjective judgments are reliable using several different statistics, none of which are easily interpretable or comparable to each other. Meanwhile, researchers in content analysis have already experienced the same difficulties and come up with a solution in the kappa statistic. We discuss what is wrong with reliability measures as they are currently used for discourse and dialogue work in computational linguistics and cognitive science, and argue that we would be better off as a field adopting techniques from content analysis.Comment: 9 page

    An Empirical Approach to Temporal Reference Resolution

    Full text link
    This paper presents the results of an empirical investigation of temporal reference resolution in scheduling dialogs. The algorithm adopted is primarily a linear-recency based approach that does not include a model of global focus. A fully automatic system has been developed and evaluated on unseen test data with good results. This paper presents the results of an intercoder reliability study, a model of temporal reference resolution that supports linear recency and has very good coverage, the results of the system evaluated on unseen test data, and a detailed analysis of the dialogs assessing the viability of the approach.Comment: 13 pages, latex using aclap.st

    Analytic frameworks for assessing dialogic argumentation in online learning environments

    Get PDF
    Over the last decade, researchers have developed sophisticated online learning environments to support students engaging in argumentation. This review first considers the range of functionalities incorporated within these online environments. The review then presents five categories of analytic frameworks focusing on (1) formal argumentation structure, (2) normative quality, (3) nature and function of contributions within the dialog, (4) epistemic nature of reasoning, and (5) patterns and trajectories of participant interaction. Example analytic frameworks from each category are presented in detail rich enough to illustrate their nature and structure. This rich detail is intended to facilitate researchers’ identification of possible frameworks to draw upon in developing or adopting analytic methods for their own work. Each framework is applied to a shared segment of student dialog to facilitate this illustration and comparison process. Synthetic discussions of each category consider the frameworks in light of the underlying theoretical perspectives on argumentation, pedagogical goals, and online environmental structures. Ultimately the review underscores the diversity of perspectives represented in this research, the importance of clearly specifying theoretical and environmental commitments throughout the process of developing or adopting an analytic framework, and the role of analytic frameworks in the future development of online learning environments for argumentation

    Multidisciplinary group performance—measuring integration intensity in the context of the North West London Integrated Care Pilot

    Get PDF
    Introduction: Multidisciplinary Group meetings (MDGs) are seen as key facilitators of integration, moving from individual to multidisciplinary decision-making, and from a focus on individual patients to a focus on patient groups. We have developed a method for coding MDG transcripts to identify whether they are or are not vehicles for delivering the anticipated efficiency improvements across various providers and apply it to a test case in the North West London Integrated Care Pilot. Methods: We defined ‘integrating’ as the process within the MDG meeting that enables or promotes an improved collaboration, improved understanding, and improved awareness of self and others within the local healthcare economy such that efficiency improvements could be identified and action taken. Utterances within the MDGs are coded according to three distinct domains grounded in concepts from communication, group decision-making, and integrated care literatures—the Valence, the Focus, and the Level. Standardized weighted integrative intensity scores are calculated across ten time deciles in the Case Discussion providing a graphical representation of its integrative intensity. Results: Intra- and Inter-rater reliability of the coding scheme was very good as measured by the Prevalence and Bias-adjusted Kappa Score. Standardized Weighted Integrative Intensity graph mirrored closely the verbatim transcript and is a convenient representation of complex communication dynamics. Trend in integrative intensity can be calculated and the characteristics of the MDG can be pragmatically described. Conclusion: This is a novel and potentially useful method for researchers, managers and practitioners to better understand MDG dynamics and to identify whether participants are integrating. The degree to which participants use MDG meetings to develop an integrated way of working is likely to require management, leadership and shared values

    Exploiting `Subjective' Annotations

    Get PDF
    Many interesting phenomena in conversation can only be annotated as a subjective task, requiring interpretative judgements from annotators. This leads to data which is annotated with lower levels of agreement not only due to errors in the annotation, but also due to the differences in how annotators interpret conversations. This paper constitutes an attempt to find out how subjective annotations with a low level of agreement can profitably be used for machine learning purposes. We analyse the (dis)agreements between annotators for two different cases in a multimodal annotated corpus and explicitly relate the results to the way machine-learning algorithms perform on the annotated data. Finally we present two new concepts, namely `subjective entity' classifiers resp. `consensus objective' classifiers, and give recommendations for using subjective data in machine-learning applications.\u
    corecore