2,319 research outputs found
Recommended from our members
Automated Essay Scoring: A Literature Review
In recent decades, large-scale English language proficiency testing and testing research have seen an increased interest in constructed-response essay-writing items (Aschbacher, 1991; Powers, Burstein, Chodorow, Fowles, & Kukich, 2001; Weigle, 2002). The TOEFL iBT, for example, includes two constructed-response writing tasks, one of which is an integrative task requiring the test-taker to write in response to information delivered both aurally and in written form (Educational Testing Service, n.d.). Similarly, the IELTS academic test requires test-takers to write in response to a question that relates to a chart or graph that the test-taker must read and interpret (International English Language Testing System, n.d.). Theoretical justification for the use of such integrative, constructed-response tasks (i.e., tasks which require the test-taker to draw upon information received through several modalities in support of a communicative function) date back to at least the early 1960’s. Carroll (1961, 1972) argued that tests which measure linguistic knowledge alone fail to predict the knowledge and abilities that score users are most likely to be interested in, i.e., prediction of actual use of language knowledge for communicative purposes in specific contexts
Evaluating Quadratic Weighted Kappa as the Standard Performance Metric for Automated Essay Scoring
Automated Essay Scoring (AES) tools aim to improve the efficiency and consistency of essay scoring by using machine learning algorithms. In the existing research work on this topic, most researchers agree that human-automated score agreement remains the benchmark for assessing the accuracy of machine-generated scores. To measure the performance of AES models, the Quadratic Weighted Kappa (QWK) is commonly used as the evaluation metric. However, we have identified several limitations of using QWK as the sole metric for evaluating AES model performance. These limitations include its sensitivity to the rating scale, the potential for the so-called kappa paradox to occur, the impact of prevalence, the impact of the position of agreements in the diagonal agreement matrix, and its limitation in handling a large number of raters. Our findings suggest that relying solely on QWK as the evaluation metric for AES performance may not be sufficient. We further discuss insights into additional metrics to comprehensively evaluate the performance and accuracy of AES models
Evaluating Quadratic Weighted Kappa as the Standard Performance Metric for Automated Essay Scoring
Automated Essay Scoring (AES) tools aim to improve the efficiency and consistency of essay scoring by using machine learning algorithms. In the existing research work on this topic, most researchers agree that human-automated score agreement remains the benchmark for assessing the accuracy of machine-generated scores. To measure the performance of AES models, the Quadratic Weighted Kappa (QWK) is commonly used as the evaluation metric. However, we have identified several limitations of using QWK as the sole metric for evaluating AES model performance. These limitations include its sensitivity to the rating scale, the potential for the so-called “kappa paradox” to occur, the impact of prevalence, the impact of the position of agreements in the diagonal agreement matrix, and its limitation in handling a large number of raters. Our findings suggest that relying solely on QWK as the evaluation metric for AES performance may not be sufficient. We further discuss insights into additional metrics to comprehensively evaluate the performance and accuracy of AES models
Improving fairness in machine learning systems: What do industry practitioners need?
The potential for machine learning (ML) systems to amplify social inequities
and unfairness is receiving increasing popular and academic attention. A surge
of recent work has focused on the development of algorithmic tools to assess
and mitigate such unfairness. If these tools are to have a positive impact on
industry practice, however, it is crucial that their design be informed by an
understanding of real-world needs. Through 35 semi-structured interviews and an
anonymous survey of 267 ML practitioners, we conduct the first systematic
investigation of commercial product teams' challenges and needs for support in
developing fairer ML systems. We identify areas of alignment and disconnect
between the challenges faced by industry practitioners and solutions proposed
in the fair ML research literature. Based on these findings, we highlight
directions for future ML and HCI research that will better address industry
practitioners' needs.Comment: To appear in the 2019 ACM CHI Conference on Human Factors in
Computing Systems (CHI 2019
Recommended from our members
Examining the Effects of Changes in Automated Rater Bias and Variability on Test Equating Solutions
Many studies have examined the quality of automated raters, but none have focused on the potential effects of systematic rater error on the psychometric properties of test scores. This simulation study examines the comparability of test scores under multiple rater bias and variability conditions, and addresses questions of their effects on test equating solutions. Effects are characterized by a comparison of equated and observed raw scores and estimates of examinee ability across the bias and variability scenarios. Findings suggest that the presence of, and changes in, rater bias and variability affect the equivalence of total raw scores, particularly at higher and lower ends of the score scale. The effects are shown to be larger where variability levels are higher, and, generally, where more constructed response items are used in the equating. Preliminary findings also suggest that consistently higher rater variability may have a slightly larger negative impact on the comparability of scores than does reducing rater bias and variability under the conditions examined here. Finally, a non-equivalent groups anchor test (NEAT) equating design may be slightly more robust to changes in rater bias and variability than a single group equating design for the bias scenarios investigated
Prompt- and Trait Relation-aware Cross-prompt Essay Trait Scoring
Automated essay scoring (AES) aims to score essays written for a given
prompt, which defines the writing topic. Most existing AES systems assume to
grade essays of the same prompt as used in training and assign only a holistic
score. However, such settings conflict with real-education situations;
pre-graded essays for a particular prompt are lacking, and detailed trait
scores of sub-rubrics are required. Thus, predicting various trait scores of
unseen-prompt essays (called cross-prompt essay trait scoring) is a remaining
challenge of AES. In this paper, we propose a robust model: prompt- and trait
relation-aware cross-prompt essay trait scorer. We encode prompt-aware essay
representation by essay-prompt attention and utilizing the topic-coherence
feature extracted by the topic-modeling mechanism without access to labeled
data; therefore, our model considers the prompt adherence of an essay, even in
a cross-prompt setting. To facilitate multi-trait scoring, we design
trait-similarity loss that encapsulates the correlations of traits. Experiments
prove the efficacy of our model, showing state-of-the-art results for all
prompts and traits. Significant improvements in low-resource-prompt and
inferior traits further indicate our model's strength.Comment: Accepted at ACL 2023 (Findings, long paper
When Automated Assessment Meets Automated Content Generation: Examining Text Quality in the Era of GPTs
The use of machine learning (ML) models to assess and score textual data has
become increasingly pervasive in an array of contexts including natural
language processing, information retrieval, search and recommendation, and
credibility assessment of online content. A significant disruption at the
intersection of ML and text are text-generating large-language models such as
generative pre-trained transformers (GPTs). We empirically assess the
differences in how ML-based scoring models trained on human content assess the
quality of content generated by humans versus GPTs. To do so, we propose an
analysis framework that encompasses essay scoring ML-models, human and
ML-generated essays, and a statistical model that parsimoniously considers the
impact of type of respondent, prompt genre, and the ML model used for
assessment model. A rich testbed is utilized that encompasses 18,460
human-generated and GPT-based essays. Results of our benchmark analysis reveal
that transformer pretrained language models (PLMs) more accurately score human
essay quality as compared to CNN/RNN and feature-based ML methods.
Interestingly, we find that the transformer PLMs tend to score GPT-generated
text 10-15\% higher on average, relative to human-authored documents.
Conversely, traditional deep learning and feature-based ML models score human
text considerably higher. Further analysis reveals that although the
transformer PLMs are exclusively fine-tuned on human text, they more
prominently attend to certain tokens appearing only in GPT-generated text,
possibly due to familiarity/overlap in pre-training. Our framework and results
have implications for text classification settings where automated scoring of
text is likely to be disrupted by generative AI.Comment: Data available at:
https://github.com/nd-hal/automated-ML-scoring-versus-generatio
Automated Essay Scoring: A Literature Review
In recent decades, large-scale English language proficiency testing and testing research have seen an increased interest in constructed-response essay-writing items (Aschbacher, 1991; Powers, Burstein, Chodorow, Fowles, & Kukich, 2001; Weigle, 2002). The TOEFL iBT, for example, includes two constructed-response writing tasks, one of which is an integrative task requiring the test-taker to write in response to information delivered both aurally and in written form (Educational Testing Service, n.d.). Similarly, the IELTS academic test requires test-takers to write in response to a question that relates to a chart or graph that the test-taker must read and interpret (International English Language Testing System, n.d.). Theoretical justification for the use of such integrative, constructed-response tasks (i.e., tasks which require the test-taker to draw upon information received through several modalities in support of a communicative function) date back to at least the early 1960’s. Carroll (1961, 1972) argued that tests which measure linguistic knowledge alone fail to predict the knowledge and abilities that score users are most likely to be interested in, i.e., prediction of actual use of language knowledge for communicative purposes in specific contexts
Professional Judgment in an Era of Artificial Intelligence and Machine Learning
Though artificial intelligence (AI) in healthcare and education now accomplishes diverse tasks, there are two features that tend to unite the information processing behind efforts to substitute it for professionals in these fields: reductionism and functionalism. True believers in substitutive automation tend to model work in human services by reducing the professional role to a set of behaviors initiated by some stimulus, which are intended to accomplish some predetermined goal, or maximize some measure of well-being. However, true professional judgment hinges on a way of knowing the world that is at odds with the epistemology of substitutive automation. Instead of reductionism, an encompassing holism is a hallmark of professional practice—an ability to integrate facts and values, the demands of the particular case and prerogatives of society, and the delicate balance between mission and margin. Any presently plausible vision of substituting AI for education and health-care professionals would necessitate a corrosive reductionism. The only way these sectors can progress is to maintain, at their core, autonomous professionals capable of carefully intermediating between technology and the patients it would help treat, or the students it would help learn
- …