Search CORE

2,319 research outputs found

Recommended from our members

Automated Essay Scoring: A Literature Review

Author: Blood Ian A.
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2011
Field of study

In recent decades, large-scale English language proficiency testing and testing research have seen an increased interest in constructed-response essay-writing items (Aschbacher, 1991; Powers, Burstein, Chodorow, Fowles, & Kukich, 2001; Weigle, 2002). The TOEFL iBT, for example, includes two constructed-response writing tasks, one of which is an integrative task requiring the test-taker to write in response to information delivered both aurally and in written form (Educational Testing Service, n.d.). Similarly, the IELTS academic test requires test-takers to write in response to a question that relates to a chart or graph that the test-taker must read and interpret (International English Language Testing System, n.d.). Theoretical justification for the use of such integrative, constructed-response tasks (i.e., tasks which require the test-taker to draw upon information received through several modalities in support of a communicative function) date back to at least the early 1960’s. Carroll (1961, 1972) argued that tests which measure linguistic knowledge alone fail to predict the knowledge and abilities that score users are most likely to be interested in, i.e., prediction of actual use of language knowledge for communicative purposes in specific contexts

Columbia University Academic Commons

Evaluating Quadratic Weighted Kappa as the Standard Performance Metric for Automated Essay Scoring

Author: Doewes Afrizal
Kurdhi Nughthoh
Saxena Akrati
Publication venue: International Educational Data Mining Society (IEDMS)
Publication date: 05/07/2023
Field of study

Automated Essay Scoring (AES) tools aim to improve the efficiency and consistency of essay scoring by using machine learning algorithms. In the existing research work on this topic, most researchers agree that human-automated score agreement remains the benchmark for assessing the accuracy of machine-generated scores. To measure the performance of AES models, the Quadratic Weighted Kappa (QWK) is commonly used as the evaluation metric. However, we have identified several limitations of using QWK as the sole metric for evaluating AES model performance. These limitations include its sensitivity to the rating scale, the potential for the so-called kappa paradox to occur, the impact of prevalence, the impact of the position of agreements in the diagonal agreement matrix, and its limitation in handling a large number of raters. Our findings suggest that relying solely on QWK as the evaluation metric for AES performance may not be sufficient. We further discuss insights into additional metrics to comprehensively evaluate the performance and accuracy of AES models

Pure OAI Repository

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Evaluating Quadratic Weighted Kappa as the Standard Performance Metric for Automated Essay Scoring

Author: Doewes Afrizal
Kurdhi Nughthoh
Saxena Akrati
Publication venue: International Educational Data Mining Society (IEDMS)
Publication date: 11/07/2023
Field of study

Automated Essay Scoring (AES) tools aim to improve the efficiency and consistency of essay scoring by using machine learning algorithms. In the existing research work on this topic, most researchers agree that human-automated score agreement remains the benchmark for assessing the accuracy of machine-generated scores. To measure the performance of AES models, the Quadratic Weighted Kappa (QWK) is commonly used as the evaluation metric. However, we have identified several limitations of using QWK as the sole metric for evaluating AES model performance. These limitations include its sensitivity to the rating scale, the potential for the so-called “kappa paradox” to occur, the impact of prevalence, the impact of the position of agreements in the diagonal agreement matrix, and its limitation in handling a large number of raters. Our findings suggest that relying solely on QWK as the evaluation metric for AES performance may not be sufficient. We further discuss insights into additional metrics to comprehensively evaluate the performance and accuracy of AES models

Pure OAI Repository

Improving fairness in machine learning systems: What do industry practitioners need?

Author: ACM.
Agarwal Alekh
Attenberg Josh
Barocas Solon
Binns Reuben
Bolukbasi Tolga
Bosch Nigel
Buolamwini Joy
Chouldechova Alexandra
DSSG.
Green Ben
Kamar Ece
Kamar Ece
Kilbertus Niki
Kleinberg Jon
Kusner Matt J
Lakkaraju Himabindu
Liu Anqi
Liu Hugo
Liu Lydia T
Lyu Lingyu
Maclellan Christopher J
Nushi Besmira
Raghavan Manish
Sculley D.
Springer Aaron
Vaughan Jennifer Wortman
Yang Qian
Zhao Zian
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 07/01/2019
Field of study

The potential for machine learning (ML) systems to amplify social inequities and unfairness is receiving increasing popular and academic attention. A surge of recent work has focused on the development of algorithmic tools to assess and mitigate such unfairness. If these tools are to have a positive impact on industry practice, however, it is crucial that their design be informed by an understanding of real-world needs. Through 35 semi-structured interviews and an anonymous survey of 267 ML practitioners, we conduct the first systematic investigation of commercial product teams' challenges and needs for support in developing fairer ML systems. We identify areas of alignment and disconnect between the challenges faced by industry practitioners and solutions proposed in the fair ML research literature. Based on these findings, we highlight directions for future ML and HCI research that will better address industry practitioners' needs.Comment: To appear in the 2019 ACM CHI Conference on Human Factors in Computing Systems (CHI 2019

arXiv.org e-Print Archive

Crossref

Recommended from our members

Examining the Effects of Changes in Automated Rater Bias and Variability on Test Equating Solutions

Author: Boyer Michelle
Publication venue: ScholarWorks@UMass Amherst
Publication date: 25/10/2018
Field of study

Many studies have examined the quality of automated raters, but none have focused on the potential effects of systematic rater error on the psychometric properties of test scores. This simulation study examines the comparability of test scores under multiple rater bias and variability conditions, and addresses questions of their effects on test equating solutions. Effects are characterized by a comparison of equated and observed raw scores and estimates of examinee ability across the bias and variability scenarios. Findings suggest that the presence of, and changes in, rater bias and variability affect the equivalence of total raw scores, particularly at higher and lower ends of the score scale. The effects are shown to be larger where variability levels are higher, and, generally, where more constructed response items are used in the equating. Preliminary findings also suggest that consistently higher rater variability may have a slightly larger negative impact on the comparability of scores than does reducing rater bias and variability under the conditions examined here. Finally, a non-equivalent groups anchor test (NEAT) equating design may be slightly more robust to changes in rater bias and variability than a single group equating design for the bias scenarios investigated

ScholarWorks@UMass Amherst

Prompt- and Trait Relation-aware Cross-prompt Essay Trait Scoring

Author: Do Heejin
Kim Yunsu
Lee Gary Geunbae
Publication venue
Publication date: 26/05/2023
Field of study

Automated essay scoring (AES) aims to score essays written for a given prompt, which defines the writing topic. Most existing AES systems assume to grade essays of the same prompt as used in training and assign only a holistic score. However, such settings conflict with real-education situations; pre-graded essays for a particular prompt are lacking, and detailed trait scores of sub-rubrics are required. Thus, predicting various trait scores of unseen-prompt essays (called cross-prompt essay trait scoring) is a remaining challenge of AES. In this paper, we propose a robust model: prompt- and trait relation-aware cross-prompt essay trait scorer. We encode prompt-aware essay representation by essay-prompt attention and utilizing the topic-coherence feature extracted by the topic-modeling mechanism without access to labeled data; therefore, our model considers the prompt adherence of an essay, even in a cross-prompt setting. To facilitate multi-trait scoring, we design trait-similarity loss that encapsulates the correlations of traits. Experiments prove the efficacy of our model, showing state-of-the-art results for all prompts and traits. Significant improvements in low-resource-prompt and inferior traits further indicate our model's strength.Comment: Accepted at ACL 2023 (Findings, long paper

arXiv.org e-Print Archive

When Automated Assessment Meets Automated Content Generation: Examining Text Quality in the Era of GPTs

Author: Abbasi Ahmed
Bevilacqua Marialena
Gan Yi
Oketch Kezia
Qin Ruiyang
Stamey Will
Yang Kai
Zhang Xinyuan
Publication venue
Publication date: 25/09/2023
Field of study

The use of machine learning (ML) models to assess and score textual data has become increasingly pervasive in an array of contexts including natural language processing, information retrieval, search and recommendation, and credibility assessment of online content. A significant disruption at the intersection of ML and text are text-generating large-language models such as generative pre-trained transformers (GPTs). We empirically assess the differences in how ML-based scoring models trained on human content assess the quality of content generated by humans versus GPTs. To do so, we propose an analysis framework that encompasses essay scoring ML-models, human and ML-generated essays, and a statistical model that parsimoniously considers the impact of type of respondent, prompt genre, and the ML model used for assessment model. A rich testbed is utilized that encompasses 18,460 human-generated and GPT-based essays. Results of our benchmark analysis reveal that transformer pretrained language models (PLMs) more accurately score human essay quality as compared to CNN/RNN and feature-based ML methods. Interestingly, we find that the transformer PLMs tend to score GPT-generated text 10-15\% higher on average, relative to human-authored documents. Conversely, traditional deep learning and feature-based ML models score human text considerably higher. Further analysis reveals that although the transformer PLMs are exclusively fine-tuned on human text, they more prominently attend to certain tokens appearing only in GPT-generated text, possibly due to familiarity/overlap in pre-training. Our framework and results have implications for text classification settings where automated scoring of text is likely to be disrupted by generative AI.Comment: Data available at: https://github.com/nd-hal/automated-ML-scoring-versus-generatio

arXiv.org e-Print Archive

Automated Essay Scoring: A Literature Review

Author: Ian Blood
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/04/2015
Field of study

Directory of Open Access Journals

Professional Judgment in an Era of Artificial Intelligence and Machine Learning

Author: Pasquale Frank A.
Publication venue: DigitalCommons@UM Carey Law
Publication date: 01/01/2019
Field of study

Though artificial intelligence (AI) in healthcare and education now accomplishes diverse tasks, there are two features that tend to unite the information processing behind efforts to substitute it for professionals in these fields: reductionism and functionalism. True believers in substitutive automation tend to model work in human services by reducing the professional role to a set of behaviors initiated by some stimulus, which are intended to accomplish some predetermined goal, or maximize some measure of well-being. However, true professional judgment hinges on a way of knowing the world that is at odds with the epistemology of substitutive automation. Instead of reductionism, an encompassing holism is a hallmark of professional practice—an ability to integrate facts and values, the demands of the particular case and prerogatives of society, and the delicate balance between mission and margin. Any presently plausible vision of substituting AI for education and health-care professionals would necessitate a corrosive reductionism. The only way these sectors can progress is to maintain, at their core, autonomous professionals capable of carefully intermediating between technology and the patients it would help treat, or the students it would help learn

Digital Commons @ UM Law