2,319 research outputs found

    Evaluating Quadratic Weighted Kappa as the Standard Performance Metric for Automated Essay Scoring

    Get PDF
    Automated Essay Scoring (AES) tools aim to improve the efficiency and consistency of essay scoring by using machine learning algorithms. In the existing research work on this topic, most researchers agree that human-automated score agreement remains the benchmark for assessing the accuracy of machine-generated scores. To measure the performance of AES models, the Quadratic Weighted Kappa (QWK) is commonly used as the evaluation metric. However, we have identified several limitations of using QWK as the sole metric for evaluating AES model performance. These limitations include its sensitivity to the rating scale, the potential for the so-called kappa paradox to occur, the impact of prevalence, the impact of the position of agreements in the diagonal agreement matrix, and its limitation in handling a large number of raters. Our findings suggest that relying solely on QWK as the evaluation metric for AES performance may not be sufficient. We further discuss insights into additional metrics to comprehensively evaluate the performance and accuracy of AES models

    Evaluating Quadratic Weighted Kappa as the Standard Performance Metric for Automated Essay Scoring

    Get PDF
    Automated Essay Scoring (AES) tools aim to improve the efficiency and consistency of essay scoring by using machine learning algorithms. In the existing research work on this topic, most researchers agree that human-automated score agreement remains the benchmark for assessing the accuracy of machine-generated scores. To measure the performance of AES models, the Quadratic Weighted Kappa (QWK) is commonly used as the evaluation metric. However, we have identified several limitations of using QWK as the sole metric for evaluating AES model performance. These limitations include its sensitivity to the rating scale, the potential for the so-called “kappa paradox” to occur, the impact of prevalence, the impact of the position of agreements in the diagonal agreement matrix, and its limitation in handling a large number of raters. Our findings suggest that relying solely on QWK as the evaluation metric for AES performance may not be sufficient. We further discuss insights into additional metrics to comprehensively evaluate the performance and accuracy of AES models

    Improving fairness in machine learning systems: What do industry practitioners need?

    Full text link
    The potential for machine learning (ML) systems to amplify social inequities and unfairness is receiving increasing popular and academic attention. A surge of recent work has focused on the development of algorithmic tools to assess and mitigate such unfairness. If these tools are to have a positive impact on industry practice, however, it is crucial that their design be informed by an understanding of real-world needs. Through 35 semi-structured interviews and an anonymous survey of 267 ML practitioners, we conduct the first systematic investigation of commercial product teams' challenges and needs for support in developing fairer ML systems. We identify areas of alignment and disconnect between the challenges faced by industry practitioners and solutions proposed in the fair ML research literature. Based on these findings, we highlight directions for future ML and HCI research that will better address industry practitioners' needs.Comment: To appear in the 2019 ACM CHI Conference on Human Factors in Computing Systems (CHI 2019

    Prompt- and Trait Relation-aware Cross-prompt Essay Trait Scoring

    Full text link
    Automated essay scoring (AES) aims to score essays written for a given prompt, which defines the writing topic. Most existing AES systems assume to grade essays of the same prompt as used in training and assign only a holistic score. However, such settings conflict with real-education situations; pre-graded essays for a particular prompt are lacking, and detailed trait scores of sub-rubrics are required. Thus, predicting various trait scores of unseen-prompt essays (called cross-prompt essay trait scoring) is a remaining challenge of AES. In this paper, we propose a robust model: prompt- and trait relation-aware cross-prompt essay trait scorer. We encode prompt-aware essay representation by essay-prompt attention and utilizing the topic-coherence feature extracted by the topic-modeling mechanism without access to labeled data; therefore, our model considers the prompt adherence of an essay, even in a cross-prompt setting. To facilitate multi-trait scoring, we design trait-similarity loss that encapsulates the correlations of traits. Experiments prove the efficacy of our model, showing state-of-the-art results for all prompts and traits. Significant improvements in low-resource-prompt and inferior traits further indicate our model's strength.Comment: Accepted at ACL 2023 (Findings, long paper

    When Automated Assessment Meets Automated Content Generation: Examining Text Quality in the Era of GPTs

    Full text link
    The use of machine learning (ML) models to assess and score textual data has become increasingly pervasive in an array of contexts including natural language processing, information retrieval, search and recommendation, and credibility assessment of online content. A significant disruption at the intersection of ML and text are text-generating large-language models such as generative pre-trained transformers (GPTs). We empirically assess the differences in how ML-based scoring models trained on human content assess the quality of content generated by humans versus GPTs. To do so, we propose an analysis framework that encompasses essay scoring ML-models, human and ML-generated essays, and a statistical model that parsimoniously considers the impact of type of respondent, prompt genre, and the ML model used for assessment model. A rich testbed is utilized that encompasses 18,460 human-generated and GPT-based essays. Results of our benchmark analysis reveal that transformer pretrained language models (PLMs) more accurately score human essay quality as compared to CNN/RNN and feature-based ML methods. Interestingly, we find that the transformer PLMs tend to score GPT-generated text 10-15\% higher on average, relative to human-authored documents. Conversely, traditional deep learning and feature-based ML models score human text considerably higher. Further analysis reveals that although the transformer PLMs are exclusively fine-tuned on human text, they more prominently attend to certain tokens appearing only in GPT-generated text, possibly due to familiarity/overlap in pre-training. Our framework and results have implications for text classification settings where automated scoring of text is likely to be disrupted by generative AI.Comment: Data available at: https://github.com/nd-hal/automated-ML-scoring-versus-generatio

    Automated Essay Scoring: A Literature Review

    Get PDF
    In recent decades, large-scale English language proficiency testing and testing research have seen an increased interest in constructed-response essay-writing items (Aschbacher, 1991; Powers, Burstein, Chodorow, Fowles, & Kukich, 2001; Weigle, 2002). The TOEFL iBT, for example, includes two constructed-response writing tasks, one of which is an integrative task requiring the test-taker to write in response to information delivered both aurally and in written form (Educational Testing Service, n.d.). Similarly, the IELTS academic test requires test-takers to write in response to a question that relates to a chart or graph that the test-taker must read and interpret (International English Language Testing System, n.d.). Theoretical justification for the use of such integrative, constructed-response tasks (i.e., tasks which require the test-taker to draw upon information received through several modalities in support of a communicative function) date back to at least the early 1960’s. Carroll (1961, 1972) argued that tests which measure linguistic knowledge alone fail to predict the knowledge and abilities that score users are most likely to be interested in, i.e., prediction of actual use of language knowledge for communicative purposes in specific contexts

    Professional Judgment in an Era of Artificial Intelligence and Machine Learning

    Get PDF
    Though artificial intelligence (AI) in healthcare and education now accomplishes diverse tasks, there are two features that tend to unite the information processing behind efforts to substitute it for professionals in these fields: reductionism and functionalism. True believers in substitutive automation tend to model work in human services by reducing the professional role to a set of behaviors initiated by some stimulus, which are intended to accomplish some predetermined goal, or maximize some measure of well-being. However, true professional judgment hinges on a way of knowing the world that is at odds with the epistemology of substitutive automation. Instead of reductionism, an encompassing holism is a hallmark of professional practice—an ability to integrate facts and values, the demands of the particular case and prerogatives of society, and the delicate balance between mission and margin. Any presently plausible vision of substituting AI for education and health-care professionals would necessitate a corrosive reductionism. The only way these sectors can progress is to maintain, at their core, autonomous professionals capable of carefully intermediating between technology and the patients it would help treat, or the students it would help learn
    • …
    corecore