745 research outputs found

    A robust methodology for automated essay grading

    Get PDF
    None of the available automated essay grading systems can be used to grade essays according to the National Assessment Program – Literacy and Numeracy (NAPLAN) analytic scoring rubric used in Australia. This thesis is a humble effort to address this limitation. The objective of this thesis is to develop a robust methodology for automatically grading essays based on the NAPLAN rubric by using heuristics and rules based on English language and neural network modelling

    Semantics-based automated essay evaluation

    Get PDF
    Automated essay evaluation (AEE) is a widely used practical solution for replacing time-consuming manual grading of student essays. Automated systems are used in combination with human graders in different high-stake assessments, as well as in classrooms. During the last 50 years, since the beginning of the development of the field, many challenges have arisen in the field, including seeking ways to evaluate the semantic content, providing automated feedback, determining reliability of grades, making the field more "exposed", and others. In this dissertation we address several of these challenges and propose novel solutions for semantic based essay evaluation. Most of the AEE research has been conducted by commercial organizations that protect their investments by releasing proprietary systems where details are not publicly available. We provide comparison (as detailed as possible) of 20 state-of-the-art approaches for automated essay evaluation and we propose a new automated essay evaluation system named SAGE (Semantic Automated Grader for Essays) with all the technological details revealed to the scientific community. Lack of consideration of text semantics is one of the main weaknesses of the existing state-of-the-art systems. We address the evaluation of essay semantics from perspectives of essay coherence and semantic error detection. Coherence describes the flow of information in an essay and allows us to evaluate the connections between the discourse. We propose two groups of coherence attributes: coherence attributes obtained in a highly dimensional semantic space and coherence attributes obtained from a sentence-similarity networks. Furthermore, we propose the Automated Error Detection (AED) system and evaluate the essay semantics from the perspective of essay consistency. The system detects semantic errors using information extraction and logic reasoning and is able to provide semantic feedback for the writer. The proposed system SAGE achieves significantly higher grading accuracy compared with other state-of-the-art automated essay evaluation systems. In the last part of the dissertation we address the question of reliability of grades. Despite the unified grading rules, human graders introduce bias into scores. Consequently, a grading model has to implement a grading logic that may be a mixture of grading logics from various graders. We propose an approach for separating a set of essays into subsets that represent different graders, which uses an explanation methodology and clustering. The results show that learning from the ensemble of separated models significantly improves the average prediction accuracy on artificial and real-world datasets

    Defining and Assessing Critical Thinking: toward an automatic analysis of HiEd students’ written texts

    Get PDF
    L'obiettivo principale di questa tesi di dottorato è testare, attraverso due studi empirici, l'affidabilità di un metodo volto a valutare automaticamente le manifestazioni del Pensiero Critico (CT) nei testi scritti da studenti universitari. Gli studi empirici si sono basati su una review critica della letteratura volta a proporre una nuova classificazione per sistematizzare le diverse definizioni di CT e i relativi approcci teorici. La review esamina anche la relazione tra le diverse definizioni di CT e i relativi metodi di valutazione. Dai risultati emerge la necessità di concentrarsi su misure aperte per la valutazione del CT e di sviluppare strumenti automatici basati su tecniche di elaborazione del linguaggio naturale (NLP) per superare i limiti attuali delle misure aperte, come l’attendibilità e i costi di scoring. Sulla base di una rubrica sviluppata e implementata dal gruppo di ricerca del Centro di Didattica Museale – Università di Roma Tre (CDM) per la valutazione e l'analisi dei livelli di CT all'interno di risposte aperte (Poce, 2017), è stato progettato un prototipo per la misurazione automatica di alcuni indicatori di CT. Il primo studio empirico condotto su un gruppo di 66 docenti universitari mostra livelli di affidabilità soddisfacenti della rubrica di valutazione, mentre la valutazione effettuata dal prototipo non era sufficientemente attendibile. I risultati di questa sperimentazione sono stati utilizzati per capire come e in quali condizioni il modello funziona meglio. La seconda indagine empirica era volta a capire quali indicatori del linguaggio naturale sono maggiormente associati a sei sottodimensioni del CT, valutate da esperti in saggi scritti in lingua italiana. Lo studio ha utilizzato un corpus di 103 saggi pre-post di studenti universitari di laurea magistrale che hanno frequentato il corso di "Pedagogia sperimentale e valutazione scolastica". All'interno del corso, sono state proposte due attività per stimolare il CT degli studenti: la valutazione delle risorse educative aperte (OER) (obbligatoria e online) e la progettazione delle OER (facoltativa e in modalità blended). I saggi sono stati valutati sia da valutatori esperti, considerando sei sotto-dimensioni del CT, sia da un algoritmo che misura automaticamente diversi tipi di indicatori del linguaggio naturale. Abbiamo riscontrato un'affidabilità interna positiva e un accordo tra valutatori medio-alto. I livelli di CT degli studenti sono migliorati in modo significativo nel post-test. Tre indicatori del linguaggio naturale sono 5 correlati in modo significativo con il punteggio totale di CT: la lunghezza del corpus, la complessità della sintassi e la funzione di peso tf-idf (term frequency–inverse document frequency). I risultati raccolti durante questo dottorato hanno implicazioni sia teoriche che pratiche per la ricerca e la valutazione del CT. Da un punto di vista teorico, questa tesi mostra sovrapposizioni inesplorate tra diverse tradizioni, prospettive e metodi di studio del CT. Questi punti di contatto potrebbero costituire la base per un approccio interdisciplinare e la costruzione di una comprensione condivisa di CT. I metodi di valutazione automatica possono supportare l’uso di misure aperte per la valutazione del CT, specialmente nell'insegnamento online. Possono infatti facilitare i docenti e i ricercatori nell'affrontare la crescente presenza di dati linguistici prodotti all'interno di piattaforme educative (es. Learning Management Systems). A tal fine, è fondamentale sviluppare metodi automatici per la valutazione di grandi quantità di dati che sarebbe impossibile analizzare manualmente, fornendo agli insegnanti e ai valutatori un supporto per il monitoraggio e la valutazione delle competenze dimostrate online dagli studenti.The main goal of this PhD thesis is to test, through two empirical studies, the reliability of a method aimed at automatically assessing Critical Thinking (CT) manifestations in Higher Education students’ written texts. The empirical studies were based on a critical review aimed at proposing a new classification for systematising different CT definitions and their related theoretical approaches. The review also investigates the relationship between the different adopted CT definitions and CT assessment methods. The review highlights the need to focus on open-ended measures for CT assessment and to develop automatic tools based on Natural Language Processing (NLP) technique to overcome current limitations of open-ended measures, such as reliability and costs. Based on a rubric developed and implemented by the Center for Museum Studies – Roma Tre University (CDM) research group for the evaluation and analysis of CT levels within open-ended answers (Poce, 2017), a NLP prototype for the automatic measurement of CT indicators was designed. The first empirical study was carried out on a group of 66 university teachers. The study showed satisfactory reliability levels of the CT evaluation rubric, while the evaluation carried out by the prototype was not yet sufficiently reliable. The results were used to understand how and under what conditions the model works better. The second empirical investigation was aimed at understanding which NLP features are more associated with six CT sub-dimensions as assessed by human raters in essays written in the Italian language. The study used a corpus of 103 students’ pre-post essays who attended a Master's Degree module in “Experimental Education and School Assessment” to assess students' CT levels. Within the module, we proposed two activities to stimulate students' CT: Open Educational Resources (OERs) assessment (mandatory and online) and OERs design (optional and blended). The essays were assessed both by expert evaluators, considering six CT sub-dimensions, and by an algorithm that automatically calculates different kinds of NLP features. The study shows a positive internal reliability and a medium to high inter-coder agreement in expert evaluation. Students' CT levels improved significantly in the post-test. Three NLP indicators significantly correlate with CT total score: the Corpus Length, the Syntax Complexity, and an adapted measure of Term Frequency- Inverse Document Frequency. The results collected during this PhD have both theoretical and practical implications for CT research and assessment. From a theoretical perspective, this thesis shows unexplored similarities among different CT traditions, perspectives, and study methods. These similarities could be exploited to open up an interdisciplinary dialogue among experts and build up a shared understanding of CT. Automatic assessment methods can enhance the use of open-ended measures for CT assessment, especially in online teaching. Indeed, they can support teachers and researchers to deal with the growing presence of linguistic data produced within educational 4 platforms. To this end, it is pivotal to develop automatic methods for the evaluation of large amounts of data which would be impossible to analyse manually, providing teachers an

    A FOCUS ON CONTENT: THE USE OF RUBRICS IN PEER REVIEW TO GUIDE STUDENTS AND INSTRUCTORS

    Get PDF
    Students who are solving open-ended problems would benefit from formative assessment, i.e., from receiving helpful feedback and from having an instructor who is informed about their level of performance. Open-ended problems challenge existing assessment techniques. For example, such problems may have reasonable alternative solutions, or conflicting objectives. Analyses of open-ended problems are often presented as free-form text since they require arguments and justifications for one solution over others, and students may differ in how they frame the problems according to their knowledge, beliefs and attitudes.This dissertation investigates how peer review may be used for formative assessment. Computer-Supported Peer Review in Education, a technology whose use is growing, has been shown to provide accurate summative assessment of student work, and peer feedback can indeed be helpful to students. A peer review process depends on the rubric that students use to assess and give feedback to each other. However, it is unclear how a rubric should be structured to produce feedback that is helpful to the student and at the same time to yield information that could be summarized for the instructor.The dissertation reports a study in which students wrote individual analyses of an open-ended legal problem, and then exchanged feedback using Comrade, a web application for peer review. The study compared two conditions: some students used a rubric that was relevant to legal argument in general (the domain-relevant rubric), while others used a rubric that addressed the conceptual issues embedded in the open-ended problem (the problem-specific rubric).While both rubric types yield peer ratings of student work that approximate the instructor's scores, feedback elicited by the domain-relevant rubric was redundant across its dimensions. On the contrary, peer ratings elicited by the problem-specific rubric distinguished among its dimensions. Hierarchical Bayesian models showed that ratings from both rubrics can be fit by pooling information across students, but only problem-specific ratings are fit better given information about distinct rubric dimensions

    Predicting Text Quality: Metrics for Content, Organization and Reader Interest

    Get PDF
    When people read articles---news, fiction or technical---most of the time if not always, they form perceptions about its quality. Some articles are well-written and others are poorly written. This thesis explores if such judgements can be automated so that they can be incorporated into applications such as information retrieval and automatic summarization. Text quality does not involve a single aspect but is a combination of numerous and diverse criteria including spelling, grammar, organization, informative nature, creative and beautiful language use, and page layout. In the education domain, comprehensive lists of such properties are outlined in the rubrics used for assessing writing. But computational methods for text quality have addressed only a handful of these aspects, mainly related to spelling, grammar and organization. In addition, some text quality aspects could be more relevant for one genre versus another. But previous work have placed little focus on specialized metrics based on the genre of texts. This thesis proposes new insights and techniques to address the above issues. We introduce metrics that score varied dimensions of quality such as content, organization and reader interest. For content, we present two measures: specificity and verbosity level. Specificity measures the amount of detail present in a text while verbosity captures which details are essential to include. We measure organization quality by quantifying the regularity of the intentional structure in the article and also using the specificity levels of adjacent sentences in the text. Our reader interest metrics aim to identify engaging and interesting articles. The development of these measures is backed by the use of articles from three different genres: academic writing, science journalism and automatically generated summaries. Proper presentation of content is critical during summarization because summaries have a word limit. Our specificity and verbosity metrics are developed with this genre as the focus. The argumentation structure of academic writing lends support to the idea of using intentional structure to model organization quality. Science journalism articles convey research findings in an engaging manner and are ideally suited for the development and evaluation of measures related to reader interest

    DOMAIN ADAPTATION FOR AUTOMATED ESSAY SCORING

    Get PDF
    Master'sMASTER OF SCIENC

    Measuring Syntactic Development in L2 Writing: Fine Grained Indices of Syntactic Complexity and Usage-Based Indices of Syntactic Sophistication

    Get PDF
    Syntactic complexity has been an area of significant interest in L2 writing development studies over the past 45 years. Despite the regularity in which syntactic complexity measures have been employed, the construct is still relatively under-developed, and, as a result, the cumulative results of syntactic complexity studies can appear opaque. At least three reasons exist for the current state of affairs, namely the lack of consistency and clarity by which indices of syntactic complexity have been described, the overly broad nature of the indices that have been regularly employed, and the omission of indices that focus on usage-based perspectives. This study seeks to address these three gaps through the development and validation of the Tool for the Automatic Assessment of Syntactic Sophistication and Complexity (TAASSC). TAASSC measures large and fined grained clausal and phrasal indices of syntactic complexity and usage-based frequency/contingency indices of syntactic sophistication. Using TAASSC, this study will address L2 writing development in two main ways: through the examination of syntactic development longitudinally and through the examination of human judgments of writing proficiency (e.g., expert ratings of TOEFL essays). This study will have important implications for second language acquisition, second language writing, and language assessment

    The Diagnosticity of Argument Diagrams

    Get PDF
    Can argument diagrams be used to diagnose and predict argument performance? Argumentation is a complex domain with robust and often contradictory theories about the structure and scope of valid arguments. Argumentation is central to advanced problem solving in many domains and is a core feature of day-to-day discourse. Argumentation is quite literally, all around us, and yet is rarely taught explicitly. Novices often have difficulty parsing and constructing arguments particularly in written and verbal form. Such formats obscure key argumentative moves and often mask the strengths and weaknesses of the argument structure with complicated phrasing or simple sophistry. Argument diagrams have a long history in the philosophy of argument and have been seen increased application as instructional tools. Argument diagrams reify important argument structures, avoid the serial limitations of text, and are amenable to automatic processing. This thesis addresses the question posed above. In it I show that diagrammatic models of argument can be used to predict students' essay grades and that automatically-induced models can be competitive with human grades. In the course of this analysis I survey analytical tools such as Augmented Graph Grammars that can be applied to formalize argument analysis, and detail a novel Augmented Graph Grammar formalism and implementation used in the study. I also introduce novel machine learning algorithms for regression and tolerance reduction. This work makes contributions to research on Education, Intelligent Tutoring Systems, Machine Learning, Educational Datamining, Graph Analysis, and online grading

    Automated essay scoring systems

    Get PDF
    corecore