2,972 research outputs found

    論述における談話構造および論理構造の解析

    Get PDF
    Tohoku University博士(情報科学)thesi

    Automatic Essay Scoring Systems Are Both Overstable And Oversensitive: Explaining Why And Proposing Defenses

    Get PDF
    Deep-learning based Automatic Essay Scoring (AES) systems are being actively used in various high-stake applications in education and testing. However, little research has been put to understand and interpret the black-box nature of deep-learning-based scoring algorithms. While previous studies indicate that scoring models can be easily fooled, in this paper, we explore the reason behind their surprising adversarial brittleness. We utilize recent advances in interpretability to find the extent to which features such as coherence, content, vocabulary, and relevance are important for automated scoring mechanisms. We use this to investigate the oversensitivity (i.e., large change in output score with a little change in input essay content) and overstability (i.e., little change in output scores with large changes in input essay content) of AES. Our results indicate that autoscoring models, despite getting trained as “end-to-end” models with rich contextual embeddings such as BERT, behave like bag-of-words models. A few words determine the essay score without the requirement of any context making the model largely overstable. This is in stark contrast to recent probing studies on pre-trained representation learning models, which show that rich linguistic features such as parts-of-speech and morphology are encoded by them. Further, we also find that the models have learnt dataset biases, making them oversensitive. The presence of a few words with high co-occurrence with a certain score class makes the model associate the essay sample with that score. This causes score changes in ∼95% of samples with an addition of only a few words. To deal with these issues, we propose detection-based protection models that can detect oversensitivity and samples causing overstability with high accuracies. We find that our proposed models are able to detect unusual attribution patterns and flag adversarial samples successfully

    Tapping the Potential of Coherence and Syntactic Features in Neural Models for Automatic Essay Scoring

    Full text link
    In the prompt-specific holistic score prediction task for Automatic Essay Scoring, the general approaches include pre-trained neural model, coherence model, and hybrid model that incorporate syntactic features with neural model. In this paper, we propose a novel approach to extract and represent essay coherence features with prompt-learning NSP that shows to match the state-of-the-art AES coherence model, and achieves the best performance for long essays. We apply syntactic feature dense embedding to augment BERT-based model and achieve the best performance for hybrid methodology for AES. In addition, we explore various ideas to combine coherence, syntactic information and semantic embeddings, which no previous study has done before. Our combined model also performs better than the SOTA available for combined model, even though it does not outperform our syntactic enhanced neural model. We further offer analyses that can be useful for future study.Comment: Accepted to "2022 International Conference on Asian Language Processing (IALP)

    Neural approaches to discourse coherence: modeling, evaluation and application

    Get PDF
    Discourse coherence is an important aspect of text quality that refers to the way different textual units relate to each other. In this thesis, I investigate neural approaches to modeling discourse coherence. I present a multi-task neural network where the main task is to predict a document-level coherence score and the secondary task is to learn word-level syntactic features. Additionally, I examine the effect of using contextualised word representations in single-task and multi-task setups. I evaluate my models on a synthetic dataset where incoherent documents are created by shuffling the sentence order in coherent original documents. The results show the efficacy of my multi-task learning approach, particularly when enhanced with contextualised embeddings, achieving new state-of-the-art results in ranking the coherent documents higher than the incoherent ones (96.9%). Furthermore, I apply my approach to the realistic domain of people’s everyday writing, such as emails and online posts, and further demonstrate its ability to capture various degrees of coherence. In order to further investigate the linguistic properties captured by coherence models, I create two datasets that exhibit syntactic and semantic alterations. Evaluating different models on these datasets reveals their ability to capture syntactic perturbations but their inadequacy to detect semantic changes. I find that semantic alterations are instead captured by models that first build sentence representations from averaged word embeddings, then apply a set of linear transformations over input sentence pairs. Finally, I present an application for coherence models in the pedagogical domain. I first demonstrate that state of-the-art neural approaches to automated essay scoring (AES) are not robust to adversarially created, grammatical, but incoherent sequences of sentences. Accordingly, I propose a framework for integrating and jointly training a coherence model with a state-of-the-art neural AES system in order to enhance its ability to detect such adversarial input. I show that this joint framework maintains a performance comparable to the state-of-the-art AES system in predicting a holistic essay score while significantly outperforming it in adversarial detection

    Advancement Auto-Assessment of Students Knowledge States from Natural Language Input

    Get PDF
    Knowledge Assessment is a key element in adaptive instructional systems and in particular in Intelligent Tutoring Systems because fully adaptive tutoring presupposes accurate assessment. However, this is a challenging research problem as numerous factors affect students’ knowledge state estimation such as the difficulty level of the problem, time spent in solving the problem, etc. In this research work, we tackle this research problem from three perspectives: assessing the prior knowledge of students, assessing the natural language short and long students’ responses, and knowledge tracing.Prior knowledge assessment is an important component of knowledge assessment as it facilitates the adaptation of the instruction from the very beginning, i.e., when the student starts interacting with the (computer) tutor. Grouping students into groups with similar mental models and patterns of prior level of knowledge allows the system to select the right level of scaffolding for each group of students. While not adapting instruction to each individual learner, the advantage of adapting to groups of students based on a limited number of prior knowledge levels has the advantage of decreasing the authoring costs of the tutoring system. To achieve this goal of identifying or clustering students based on their prior knowledge, we have employed effective clustering algorithms. Automatically assessing open-ended student responses is another challenging aspect of knowledge assessment in ITSs. In dialogue-based ITSs, the main interaction between the learner and the system is natural language dialogue in which students freely respond to various system prompts or initiate dialogue moves in mixed-initiative dialogue systems. Assessing freely generated student responses in such contexts is challenging as students can express the same idea in different ways owing to different individual style preferences and varied individual cognitive abilities. To address this challenging task, we have proposed several novel deep learning models as they are capable to capture rich high-level semantic features of text. Knowledge tracing (KT) is an important type of knowledge assessment which consists of tracking students’ mastery of knowledge over time and predicting their future performances. Despite the state-of-the-art results of deep learning in this task, it has many limitations. For instance, most of the proposed methods ignore pertinent information (e.g., Prior knowledge) that can enhance the knowledge tracing capability and performance. Working toward this objective, we have proposed a generic deep learning framework that accounts for the engagement level of students, the difficulty of questions and the semantics of the questions and uses a novel times series model called Temporal Convolutional Network for future performance prediction. The advanced auto-assessment methods presented in this dissertation should enable better ways to estimate learner’s knowledge states and in turn the adaptive scaffolding those systems can provide which in turn should lead to more effective tutoring and better learning gains for students. Furthermore, the proposed method should enable more scalable development and deployment of ITSs across topics and domains for the benefit of all learners of all ages and backgrounds

    Improving fairness in machine learning systems: What do industry practitioners need?

    Full text link
    The potential for machine learning (ML) systems to amplify social inequities and unfairness is receiving increasing popular and academic attention. A surge of recent work has focused on the development of algorithmic tools to assess and mitigate such unfairness. If these tools are to have a positive impact on industry practice, however, it is crucial that their design be informed by an understanding of real-world needs. Through 35 semi-structured interviews and an anonymous survey of 267 ML practitioners, we conduct the first systematic investigation of commercial product teams' challenges and needs for support in developing fairer ML systems. We identify areas of alignment and disconnect between the challenges faced by industry practitioners and solutions proposed in the fair ML research literature. Based on these findings, we highlight directions for future ML and HCI research that will better address industry practitioners' needs.Comment: To appear in the 2019 ACM CHI Conference on Human Factors in Computing Systems (CHI 2019

    A comparison of various machine learning algorithms and execution of flask deployment on essay grading

    Get PDF
    Students’ performance can be assessed based on grading the answers written by the students during their examination. Currently, students are assessed manually by the teachers. This is a cumbersome task due to an increase in the student-teacher ratio. Moreover, due to coronavirus disease (COVID-19) pandemic, most of the educational institutions have adopted online teaching and assessment. To measure the learning ability of a student, we need to assess them. The current grading system works well for multiple choice questions, but there is no grading system for evaluating the essays. In this paper, we studied different machine learning and natural language processing techniques for automated essay scoring/grading (AES/G). Data imbalance is an issue which creates the problem in predicting the essay score due to uneven distribution of essay scores in the training data. We handled this issue using random over sampling technique which generates even distribution of essay scores. Also, we built a web application using flask and deployed the machine learning models. Subsequently, all the models have been evaluated using accuracy, precision, recall, and F1-score. It is found that random forest algorithm outperformed the other algorithms with an accuracy of 97.67%, precision of 97.62%, recall of 97.67%, and F1-score of 97.58%

    LFTK: Handcrafted Features in Computational Linguistics

    Full text link
    Past research has identified a rich set of handcrafted linguistic features that can potentially assist various tasks. However, their extensive number makes it difficult to effectively select and utilize existing handcrafted features. Coupled with the problem of inconsistent implementation across research works, there has been no categorization scheme or generally-accepted feature names. This creates unwanted confusion. Also, most existing handcrafted feature extraction libraries are not open-source or not actively maintained. As a result, a researcher often has to build such an extraction system from the ground up. We collect and categorize more than 220 popular handcrafted features grounded on past literature. Then, we conduct a correlation analysis study on several task-specific datasets and report the potential use cases of each feature. Lastly, we devise a multilingual handcrafted linguistic feature extraction system in a systematically expandable manner. We open-source our system for public access to a rich set of pre-implemented handcrafted features. Our system is coined LFTK and is the largest of its kind. Find it at github.com/brucewlee/lftk.Comment: BEA @ ACL 202

    Assessing text readability and quality with language models

    Get PDF
    Automatic readability assessment is considered as a challenging task in NLP due to its high degree of subjectivity. The majority prior work in assessing readability has focused on identifying the level of education necessary for comprehension without the consideration of text quality, i.e., how naturally the text flows from the perspective of a native speaker. Therefore, in this thesis, we aim to use language models, trained on well-written prose, to measure not only text readability in terms of comprehension but text quality. In this thesis, we developed two word-level metrics based on the concordance of article text with predictions made using language models to assess text readability and quality. We evaluate both metrics on a set of corpora used for readability assessment or automated essay scoring (AES) by measuring the correlation between scores assigned by our metrics and human raters. According to the experimental results, our metrics are strongly correlated with text quality, which achieve 0.4-0.6 correlations on 7 out of 9 datasets. We demonstrate that GPT-2 surpasses other language models, including the bigram model, LSTM, and bidirectional LSTM, on the task of estimating text quality in a zero-shot setting, and GPT-2 perplexity-based measure is a reasonable indicator for text quality evaluation
    corecore