27 research outputs found
Neural Automated Essay Scoring and Coherence Modeling for Adversarially Crafted Input
We demonstrate that current state-of-the-art approaches to Automated Essay Scoring (AES) are not well-suited to capturing adversarially crafted input of grammatical but incoherent sequences of sentences. We develop a neural model of local coherence that can effectively learn connectedness features between sentences, and propose a framework for integrating and jointly training the local coherence model with a state-of-the-art AES model. We evaluate our approach against a number of baselines and experimentally demonstrate its effectiveness on both the AES task and the task of flagging adversarial input, further contributing to the development of an approach that strengthens the validity of neural essay scoring models
Automatic Essay Scoring Systems Are Both Overstable And Oversensitive: Explaining Why And Proposing Defenses
Deep-learning based Automatic Essay Scoring (AES) systems are being actively used in various high-stake applications in education and testing. However, little research has been put to understand and interpret the black-box nature of deep-learning-based scoring algorithms. While previous studies indicate that scoring models can be easily fooled, in this paper, we explore the reason behind their surprising adversarial brittleness. We utilize recent advances in interpretability to find the extent to which features such as coherence, content, vocabulary, and relevance are important for automated scoring mechanisms. We use this to investigate the oversensitivity (i.e., large change in output score with a little change in input essay content) and overstability (i.e., little change in output scores with large changes in input essay content) of AES. Our results indicate that autoscoring models, despite getting trained as “end-to-end” models with rich contextual embeddings such as BERT, behave like bag-of-words models. A few words determine the essay score without the requirement of any context making the model largely overstable. This is in stark contrast to recent probing studies on pre-trained representation learning models, which show that rich linguistic features such as parts-of-speech and morphology are encoded by them. Further, we also find that the models have learnt dataset biases, making them oversensitive. The presence of a few words with high co-occurrence with a certain score class makes the model associate the essay sample with that score. This causes score changes in ∼95% of samples with an addition of only a few words. To deal with these issues, we propose detection-based protection models that can detect oversensitivity and samples causing overstability with high accuracies. We find that our proposed models are able to detect unusual attribution patterns and flag adversarial samples successfully
Recommended from our members
Neural approaches to discourse coherence: modeling, evaluation and application
Discourse coherence is an important aspect of text quality that refers to the way different textual units relate to each other. In this thesis, I investigate neural approaches to modeling discourse coherence. I present a multi-task neural network where the main task is to predict a document-level coherence score and the secondary task is to learn word-level syntactic features. Additionally, I examine the effect of using contextualised word representations in single-task and multi-task setups. I evaluate my models on a synthetic dataset where incoherent documents are created by shuffling the sentence order in coherent original documents. The results show the efficacy of my multi-task learning approach, particularly when enhanced with contextualised embeddings, achieving new state-of-the-art results in ranking the coherent documents higher than the incoherent ones (96.9%). Furthermore, I apply my approach to the realistic domain of people’s everyday writing, such as emails and online posts, and further demonstrate its ability to capture various degrees of coherence. In order to further investigate the linguistic properties captured by coherence models, I create two datasets that exhibit syntactic and semantic alterations. Evaluating different models on these datasets reveals their ability to capture syntactic perturbations but their inadequacy to detect semantic changes. I find that semantic alterations are instead captured by models that first build sentence representations from averaged word embeddings, then apply a set of linear transformations over input sentence pairs. Finally, I present an application for coherence models in the pedagogical domain. I first demonstrate that state of-the-art neural approaches to automated essay scoring (AES) are not robust to adversarially created, grammatical, but incoherent sequences of sentences. Accordingly, I propose a framework for integrating and jointly training a coherence model with a state-of-the-art neural AES system in order to enhance its ability to detect such adversarial input. I show that this joint framework maintains a performance comparable to the state-of-the-art AES system in predicting a holistic essay score while significantly outperforming it in adversarial detection
Penilaian Esai Pendek Otomatis Berdasarkan Similaritas Semantik dengan SBERT
Ujian dalam bentuk soal esai dianggap lebih baik dalam mengukur pemahaman dari pada soal berbentuk pilihan. Namun, jawaban esai memerlukan waktu dan tenaga lebih banyak untuk dievaluasi dan sering terjadi inkonsistensi. Maka dari itu, diperlukan suatu sistem penilaian esai otomatis yang dapat membantu evaluator dalam memberikan nilai dengan lebih cepat dan lebih konsisten. Penelitian ini bertujuan untuk mengevaluasi performa model penilaian esai otomatis dimana teks esai jawaban uji dan kunci jawaban dibandingkan secara semantik untuk mengetahui seberapa besar persamaan antara teks jawaban uji dan kunci jawaban. Semantik dari teks esai diperoleh dengan melakukan word embeddings dengan memanfaatkan model bahasa pretrained Siamese-BERT (SBERT) yang mentransformasi teks esai menjadi vektor sepanjang 512. Proses penilaian esai otomatis ini dimulai dengan praproses pada teks dengan menerapkan case folding, berikutnya word embeddings pada teks yang telah di praproses dengan SBERT. Vektor numerik dari kunci jawaban dan jawaban uji hasil word embeddings kemudian dibandingkan dengan Cosine Similarity untuk mendapatkan similaritas semantik sekaligus nilai esai yang merupakan output model. Evaluasi model penilaian esai otomatis ini dilakukan dengan membandingkan nilai dari model dengan nilai dari evaluator manusia. Pengukuran yang dipakai untuk mengukur performa penilaian esai otomatis ini adalah adalah dengan menghitung Mean Absolute Error (MAE) dan Pearson Correlation, dimana hasil penelitian ini menunjukan nilai rata-rata MAE sebesar 0.26 dan rata-rata korelasi sebesar 0.78