12,828 research outputs found

    Automatic Essay Scoring Systems Are Both Overstable And Oversensitive: Explaining Why And Proposing Defenses

    Get PDF
    Deep-learning based Automatic Essay Scoring (AES) systems are being actively used in various high-stake applications in education and testing. However, little research has been put to understand and interpret the black-box nature of deep-learning-based scoring algorithms. While previous studies indicate that scoring models can be easily fooled, in this paper, we explore the reason behind their surprising adversarial brittleness. We utilize recent advances in interpretability to find the extent to which features such as coherence, content, vocabulary, and relevance are important for automated scoring mechanisms. We use this to investigate the oversensitivity (i.e., large change in output score with a little change in input essay content) and overstability (i.e., little change in output scores with large changes in input essay content) of AES. Our results indicate that autoscoring models, despite getting trained as “end-to-end” models with rich contextual embeddings such as BERT, behave like bag-of-words models. A few words determine the essay score without the requirement of any context making the model largely overstable. This is in stark contrast to recent probing studies on pre-trained representation learning models, which show that rich linguistic features such as parts-of-speech and morphology are encoded by them. Further, we also find that the models have learnt dataset biases, making them oversensitive. The presence of a few words with high co-occurrence with a certain score class makes the model associate the essay sample with that score. This causes score changes in ∼95% of samples with an addition of only a few words. To deal with these issues, we propose detection-based protection models that can detect oversensitivity and samples causing overstability with high accuracies. We find that our proposed models are able to detect unusual attribution patterns and flag adversarial samples successfully

    Prompt- and Trait Relation-aware Cross-prompt Essay Trait Scoring

    Full text link
    Automated essay scoring (AES) aims to score essays written for a given prompt, which defines the writing topic. Most existing AES systems assume to grade essays of the same prompt as used in training and assign only a holistic score. However, such settings conflict with real-education situations; pre-graded essays for a particular prompt are lacking, and detailed trait scores of sub-rubrics are required. Thus, predicting various trait scores of unseen-prompt essays (called cross-prompt essay trait scoring) is a remaining challenge of AES. In this paper, we propose a robust model: prompt- and trait relation-aware cross-prompt essay trait scorer. We encode prompt-aware essay representation by essay-prompt attention and utilizing the topic-coherence feature extracted by the topic-modeling mechanism without access to labeled data; therefore, our model considers the prompt adherence of an essay, even in a cross-prompt setting. To facilitate multi-trait scoring, we design trait-similarity loss that encapsulates the correlations of traits. Experiments prove the efficacy of our model, showing state-of-the-art results for all prompts and traits. Significant improvements in low-resource-prompt and inferior traits further indicate our model's strength.Comment: Accepted at ACL 2023 (Findings, long paper

    A Statistical Approach to Automatic Essay Scoring

    Get PDF
    Η ολοένα αυξανόμενη ανάγκη για αξιολόγηση των δεξιοτήτων γραπτού λόγου, σε συνδυασμό με την δυναμική της αυτόματης αξιολόγησης γραπτού λόγου να συνδράμει στην διδασκαλία και εκμάθηση, αλλά και την αξιολόγηση γραπτού λόγου, η παρούσα μελέτη στοχεύει στη διερεύνηση της σχέσης ανάμεσα σε υφομετρικά χαρακτηριστικά των κειμένων, άρρηκτα συνδεδεμένων με την αυτόματη αξιολόγηση γραπτού λόγου, και τον βαθμό καλλιέργειας δεξιοτήτων γραπτής έκφρασης εκ μέρους των μαθητών, όπως αυτός αποτυπώνεται στην αξιολόγηση μαθητικών εκθέσεων από εξειδικευμένους αξιολογητές. Το υπό ανάλυση σώμα κειμένων ανακτήθηκε από βάση δεδομένων προσφερόμενων στα πλαίσια πρόσφατου διαγωνισμού αυτόματης αξιολόγησης γραπτού λόγου, που πραγματοποιήθηκε στο σχολικό περιβάλλον των ΗΠΑ. Τα υφομετρικά χαρακτηριστικά των κειμένων που λήφθηκαν υπόψη στην παρούσα ανάλυση εστιάζουν κυρίως σε ενδείκτες συνοχής του κειμένου, λεξιλογικού πλούτου και εύρους μορφοσυντακτικών επιλογών. Από την παρούσα ανάλυση διαφαίνεται άμεση σχέση υφομετρικών χαρακτηριστικών των υπό ανάλυση κειμένων με την αξιολόγηση της οποίας έτυχαν στα πλαίσια του προαναφερθέντος διαγωνισμού. Το εύρημα αυτό καταδεικνύει την καίρια σημασία εντατικοποίησης της σχετικής πειραματικής διερεύνησης, με στόχο την βελτιστοποίηση της εναλλακτικής αυτής μορφής υποστήριξης των εμπλεκομένων στην διδακτική και εξεταστική διαδικασία.Taking into consideration escalating need for testing writing ability and the potential of Automatic Essay Scoring (AES) to support writing instruction and evaluation, the aim of the present study is to explore the relationship between stylometric indices, widely used in AES systems, and the degree of sophistication of learner essays, captured by the score provided by expert human raters. The data analyzed were obtained from a recently organized public AES competition and comprise persuasive essays written in the context of public school in the United States. Stylometric information taken into consideration greatly focuses on measures of cohesion, as well as lexical diversity and syntactic sophistication. Results indicate a clear relationship between quantifiable features of learners’ written responses and the impression which they have made on expert raters. This observation reinforces the importance of pursuing further experimentation into AES, which would yield significant educational and social benefits

    The role of feedback in the processes and outcomes of academic writing in english as a foreign language at intermediate and advanced levels

    Get PDF
    Providing feedback on students’ texts is one of the essential components of teaching second language writing. However, whether and to what extent students benefit from feedback has been an issue of considerable debate in the literature. While many researchers have stressed its importance, others expressed doubts about its effectiveness. Regardless of these continuing and well-established debates, instructors consider feedback as a worthwhile pedagogical practice for second language learning. Based on this premise, I conducted three experimental studies to investigate the role of written feedback in Myanmar and Hungarian tertiary EFL classrooms. Additionally, I studied syntactic features and language-related error patterns in Hungarian and Myanmar students’ writing. This attempt was made to understand how students with different writing proficiency acted upon teacher and automated feedback. The first study examined the efficacy of feedback on Myanmar students’ writing over a 13-week semester and how automated feedback provided by Grammarly could be integrated into writing instruction as an assistance tool for writing teachers. Results from pre-and post-tests demonstrated that students’ writing performance improved along the lines of four assessment criteria: task achievement, coherence and cohesion, grammatical range and accuracy, and lexical range and accuracy. Further results from a written feedback analysis revealed that the free version of Grammarly provided feedback on lower-level writing issues such as articles and prepositions, whereas teacher feedback covered both lower-and higher-level writing concerns. These findings suggested a potential for integrating automated feedback into writing instruction. As limited attention was given to how feedback influences other aspects of writing development beyond accuracy, the second study examined how feedback influences the syntactic complexity of Myanmar students’ essays. Results from paired sample t-tests revealed no significant differences in the syntactic complexity of students’ writing when the comparison was made between initial and revised texts and between pre-and post-tests. These findings suggested that feedback on students’ writing does not lead them to write less structurally complex texts despite not resulting in syntactic complexity gains. The syntactic complexity of students’ revised texts varied among high-, mid-, and low-achieving students. These variations could be attributed to proficiency levels, writing prompts, genre differences, and feedback sources. The rationale for conducting the third study was based on the theoretical orientation that differential success in learners’ gaining from feedback largely depended on their engagement with the feedback rather than the feedback itself. Along these lines of research, I examined Hungarian students’ behavioural engagement (i.e., students’ uptake or revisions prompted by written feedback) with teacher and automated feedback in an EFL writing course. In addition to the engagement with form-focused feedback examined in the first study, I considered meaning-focused feedback, as feedback in a writing course typically covers both linguistic and rhetorical aspects of writing. The results showed differences in feedback focus (the teacher provided form-and meaning-focused feedback) with unexpected outcomes: students’ uptake of feedback resulted in moderate to low levels of engagement with feedback. Participants incorporated more form-focused feedback than meaning-focused feedback into their revisions. These findings contribute to our understanding of students’ engagement with writing tasks, levels of trust, and the possible impact of students’ language proficiency on their engagement with feedback. Following the results that Myanmar and Hungarian students responded to feedback on their writing differently, I designed a follow-up study to compare syntactic features of their writing as indices of their English writing proficiency. In addition, I examined language-related errors in their texts to capture the differences in the error patterns in the two groups. Results from paired sample t-tests showed that most syntactic complexity indices distinguished the essays produced by the two groups: length of production units, sentence complexity, and subordination indices. Similarly, statistically significant differences were found in language-related error patterns in their texts: errors were more prevalent in Myanmar students’ essays. The implications for research and pedagogical practices in EFL writing classes are discussed with reference to the rationale for each study

    Automated Essay Evaluation Using Natural Language Processing and Machine Learning

    Get PDF
    The goal of automated essay evaluation is to assign grades to essays and provide feedback using computers. Automated evaluation is increasingly being used in classrooms and online exams. The aim of this project is to develop machine learning models for performing automated essay scoring and evaluate their performance. In this research, a publicly available essay data set was used to train and test the efficacy of the adopted techniques. Natural language processing techniques were used to extract features from essays in the dataset. Three different existing machine learning algorithms were used on the chosen dataset. The data was divided into two parts: training data and testing data. The inter-rater reliability and performance of these models were compared with each other and with human graders. Among the three machine learning models, the random forest performed the best in terms of agreement with human scorers as it achieved the lowest mean absolute error for the test dataset

    Machine and expert judgments of student perceptions of teaching behavior in secondary education:Added value of topic modeling with big data

    Get PDF
    Research shows that effective teaching behavior is important for students' learning and outcomes, and scholars have developed various instruments for measuring effective teaching behavior domains. Although student assessments are frequently used for evaluating teaching behavior, they are mainly in Likert-scale or categorical forms, which precludes students from freely expressing their perceptions of teaching. Drawing on an open-ended questionnaire from large-scale student surveys, this study uses a machine learning tool aiming to extract teaching behavior topics from large-scale students’ open-ended answers and to test the convergent validity of the outcomes by comparing them with theory-driven manual coding outcomes based on expert judgments. We applied a latent Dirichlet allocation (LDA) topic modeling analysis, together with a visualization tool (LDAvis), to qualitative data collected from 173,858 secondary education students in the Netherlands. This data-driven machine learning analysis yielded eight topics of teaching behavior domains: Clear explanation, Student-centered supportive learning climate, Lesson variety, Likable characteristics of the teacher, Evoking interest, Monitoring understanding, Inclusiveness and equity, Lesson objectives and formative assessment. In addition, we subjected 864 randomly selected student responses from the same dataset to manual coding, and performed theory-driven content analysis, which resulted in nine teaching behavior domains and 19 sub-domains. Results suggest that the relation between machine learning and human analysis is complementary. By comparing the bottom-up (machine learning analysis) and top-down (content analysis), we found that the proposed topic modeling approach reveals unique domains of teaching behavior, and confirmed the validity of the topic modeling outcomes evident from the overlapping topics
    corecore