2,785 research outputs found

    Can Language Models Identify Wikipedia Articles with Readability and Style Issues?

    Get PDF
    Wikipedia is frequently criticised for having poor readability and style issues. In this article, we investigate using GPT-2, a neural language model, to identify poorly written text in Wikipedia by ranking documents by their perplexity. We evaluated the properties of this ranking using human assessments of text quality, including readability, narrativity and language use. We demonstrate that GPT-2 perplexity scores correlate moderately to strongly with narrativity, but only weakly with reading comprehension scores. Importantly, the model reflects even small improvements to text as would be seen in Wikipedia edits. We conclude by highlighting that Wikipedia's featured articles counter-intuitively contain text with the highest perplexity scores. However, these examples highlight many of the complexities that need to be resolved for such an approach to be used in practice.Peer reviewe

    Examining the Role of Linguistic Flexibility in the Text Production Process

    Get PDF
    abstract: A commonly held belief among educators, researchers, and students is that high-quality texts are easier to read than low-quality texts, as they contain more engaging narrative and story-like elements. Interestingly, these assumptions have typically failed to be supported by the writing literature. Research suggests that higher quality writing is typically associated with decreased levels of text narrativity and readability. Although narrative elements may sometimes be associated with high-quality writing, the majority of research suggests that higher quality writing is associated with decreased levels of text narrativity, and measures of readability in general. One potential explanation for this conflicting evidence lies in the situational influence of text elements on writing quality. In other words, it is possible that the frequency of specific linguistic or rhetorical text elements alone is not consistently indicative of essay quality. Rather, these effects may be largely driven by individual differences in students' ability to leverage the benefits of these elements in appropriate contexts. This dissertation presents the hypothesis that writing proficiency is associated with an individual's flexible use of text properties, rather than simply the consistent use of a particular set of properties. Across three experiments, this dissertation relies on a combination of natural language processing and dynamic methodologies to examine the role of linguistic flexibility in the text production process. Overall, the studies included in this dissertation provide important insights into the role of flexibility in writing skill and develop a strong foundation on which to conduct future research and educational interventions.Dissertation/ThesisDoctoral Dissertation Psychology 201

    Sentiment and Sentence Similarity as Predictors of Integrated and Independent L2 Writing Performance

    Get PDF
    This study aimed to utilize sentiment and sentence similarity analyses, two Natural Language Processing techniques, to see if and how well they could predict L2 Writing Performance in integrated and independent task conditions. The data sources were an integrated L2 writing corpus of 185 literary analysis essays and an independent L2 writing corpus of 500 argumentative essays, both of which were compiled in higher education contexts. Both essay groups were scored between 0 and 100. Two Python libraries, TextBlob and SpaCy, were used to generate sentiment and sentence similarity data. Using sentiment (polarity and subjectivity) and sentence similarity variables, regression models were built and 95% prediction intervals were compared for integrated and independent corpora. The results showed that integrated L2 writing performance could be predicted by subjectivity and sentence similarity. However, only subjectivity predicted independent L2 writing performance. The prediction interval of subjectivity for independent writing model was found to be narrower than the same interval for integrated writing. The results show that the sentiment and sentence similarity analysis algorithms can be used to generate complementary data to improve more complex multivariate L2 writing performance prediction models

    Applications of Text Analysis Tools for Spoken Response Grading

    Get PDF

    A Statistical Approach to Automatic Essay Scoring

    Get PDF
    Η ολοένα αυξανόμενη ανάγκη για αξιολόγηση των δεξιοτήτων γραπτού λόγου, σε συνδυασμό με την δυναμική της αυτόματης αξιολόγησης γραπτού λόγου να συνδράμει στην διδασκαλία και εκμάθηση, αλλά και την αξιολόγηση γραπτού λόγου, η παρούσα μελέτη στοχεύει στη διερεύνηση της σχέσης ανάμεσα σε υφομετρικά χαρακτηριστικά των κειμένων, άρρηκτα συνδεδεμένων με την αυτόματη αξιολόγηση γραπτού λόγου, και τον βαθμό καλλιέργειας δεξιοτήτων γραπτής έκφρασης εκ μέρους των μαθητών, όπως αυτός αποτυπώνεται στην αξιολόγηση μαθητικών εκθέσεων από εξειδικευμένους αξιολογητές. Το υπό ανάλυση σώμα κειμένων ανακτήθηκε από βάση δεδομένων προσφερόμενων στα πλαίσια πρόσφατου διαγωνισμού αυτόματης αξιολόγησης γραπτού λόγου, που πραγματοποιήθηκε στο σχολικό περιβάλλον των ΗΠΑ. Τα υφομετρικά χαρακτηριστικά των κειμένων που λήφθηκαν υπόψη στην παρούσα ανάλυση εστιάζουν κυρίως σε ενδείκτες συνοχής του κειμένου, λεξιλογικού πλούτου και εύρους μορφοσυντακτικών επιλογών. Από την παρούσα ανάλυση διαφαίνεται άμεση σχέση υφομετρικών χαρακτηριστικών των υπό ανάλυση κειμένων με την αξιολόγηση της οποίας έτυχαν στα πλαίσια του προαναφερθέντος διαγωνισμού. Το εύρημα αυτό καταδεικνύει την καίρια σημασία εντατικοποίησης της σχετικής πειραματικής διερεύνησης, με στόχο την βελτιστοποίηση της εναλλακτικής αυτής μορφής υποστήριξης των εμπλεκομένων στην διδακτική και εξεταστική διαδικασία.Taking into consideration escalating need for testing writing ability and the potential of Automatic Essay Scoring (AES) to support writing instruction and evaluation, the aim of the present study is to explore the relationship between stylometric indices, widely used in AES systems, and the degree of sophistication of learner essays, captured by the score provided by expert human raters. The data analyzed were obtained from a recently organized public AES competition and comprise persuasive essays written in the context of public school in the United States. Stylometric information taken into consideration greatly focuses on measures of cohesion, as well as lexical diversity and syntactic sophistication. Results indicate a clear relationship between quantifiable features of learners’ written responses and the impression which they have made on expert raters. This observation reinforces the importance of pursuing further experimentation into AES, which would yield significant educational and social benefits

    Defining and Assessing Critical Thinking: toward an automatic analysis of HiEd students’ written texts

    Get PDF
    L'obiettivo principale di questa tesi di dottorato è testare, attraverso due studi empirici, l'affidabilità di un metodo volto a valutare automaticamente le manifestazioni del Pensiero Critico (CT) nei testi scritti da studenti universitari. Gli studi empirici si sono basati su una review critica della letteratura volta a proporre una nuova classificazione per sistematizzare le diverse definizioni di CT e i relativi approcci teorici. La review esamina anche la relazione tra le diverse definizioni di CT e i relativi metodi di valutazione. Dai risultati emerge la necessità di concentrarsi su misure aperte per la valutazione del CT e di sviluppare strumenti automatici basati su tecniche di elaborazione del linguaggio naturale (NLP) per superare i limiti attuali delle misure aperte, come l’attendibilità e i costi di scoring. Sulla base di una rubrica sviluppata e implementata dal gruppo di ricerca del Centro di Didattica Museale – Università di Roma Tre (CDM) per la valutazione e l'analisi dei livelli di CT all'interno di risposte aperte (Poce, 2017), è stato progettato un prototipo per la misurazione automatica di alcuni indicatori di CT. Il primo studio empirico condotto su un gruppo di 66 docenti universitari mostra livelli di affidabilità soddisfacenti della rubrica di valutazione, mentre la valutazione effettuata dal prototipo non era sufficientemente attendibile. I risultati di questa sperimentazione sono stati utilizzati per capire come e in quali condizioni il modello funziona meglio. La seconda indagine empirica era volta a capire quali indicatori del linguaggio naturale sono maggiormente associati a sei sottodimensioni del CT, valutate da esperti in saggi scritti in lingua italiana. Lo studio ha utilizzato un corpus di 103 saggi pre-post di studenti universitari di laurea magistrale che hanno frequentato il corso di "Pedagogia sperimentale e valutazione scolastica". All'interno del corso, sono state proposte due attività per stimolare il CT degli studenti: la valutazione delle risorse educative aperte (OER) (obbligatoria e online) e la progettazione delle OER (facoltativa e in modalità blended). I saggi sono stati valutati sia da valutatori esperti, considerando sei sotto-dimensioni del CT, sia da un algoritmo che misura automaticamente diversi tipi di indicatori del linguaggio naturale. Abbiamo riscontrato un'affidabilità interna positiva e un accordo tra valutatori medio-alto. I livelli di CT degli studenti sono migliorati in modo significativo nel post-test. Tre indicatori del linguaggio naturale sono 5 correlati in modo significativo con il punteggio totale di CT: la lunghezza del corpus, la complessità della sintassi e la funzione di peso tf-idf (term frequency–inverse document frequency). I risultati raccolti durante questo dottorato hanno implicazioni sia teoriche che pratiche per la ricerca e la valutazione del CT. Da un punto di vista teorico, questa tesi mostra sovrapposizioni inesplorate tra diverse tradizioni, prospettive e metodi di studio del CT. Questi punti di contatto potrebbero costituire la base per un approccio interdisciplinare e la costruzione di una comprensione condivisa di CT. I metodi di valutazione automatica possono supportare l’uso di misure aperte per la valutazione del CT, specialmente nell'insegnamento online. Possono infatti facilitare i docenti e i ricercatori nell'affrontare la crescente presenza di dati linguistici prodotti all'interno di piattaforme educative (es. Learning Management Systems). A tal fine, è fondamentale sviluppare metodi automatici per la valutazione di grandi quantità di dati che sarebbe impossibile analizzare manualmente, fornendo agli insegnanti e ai valutatori un supporto per il monitoraggio e la valutazione delle competenze dimostrate online dagli studenti.The main goal of this PhD thesis is to test, through two empirical studies, the reliability of a method aimed at automatically assessing Critical Thinking (CT) manifestations in Higher Education students’ written texts. The empirical studies were based on a critical review aimed at proposing a new classification for systematising different CT definitions and their related theoretical approaches. The review also investigates the relationship between the different adopted CT definitions and CT assessment methods. The review highlights the need to focus on open-ended measures for CT assessment and to develop automatic tools based on Natural Language Processing (NLP) technique to overcome current limitations of open-ended measures, such as reliability and costs. Based on a rubric developed and implemented by the Center for Museum Studies – Roma Tre University (CDM) research group for the evaluation and analysis of CT levels within open-ended answers (Poce, 2017), a NLP prototype for the automatic measurement of CT indicators was designed. The first empirical study was carried out on a group of 66 university teachers. The study showed satisfactory reliability levels of the CT evaluation rubric, while the evaluation carried out by the prototype was not yet sufficiently reliable. The results were used to understand how and under what conditions the model works better. The second empirical investigation was aimed at understanding which NLP features are more associated with six CT sub-dimensions as assessed by human raters in essays written in the Italian language. The study used a corpus of 103 students’ pre-post essays who attended a Master's Degree module in “Experimental Education and School Assessment” to assess students' CT levels. Within the module, we proposed two activities to stimulate students' CT: Open Educational Resources (OERs) assessment (mandatory and online) and OERs design (optional and blended). The essays were assessed both by expert evaluators, considering six CT sub-dimensions, and by an algorithm that automatically calculates different kinds of NLP features. The study shows a positive internal reliability and a medium to high inter-coder agreement in expert evaluation. Students' CT levels improved significantly in the post-test. Three NLP indicators significantly correlate with CT total score: the Corpus Length, the Syntax Complexity, and an adapted measure of Term Frequency- Inverse Document Frequency. The results collected during this PhD have both theoretical and practical implications for CT research and assessment. From a theoretical perspective, this thesis shows unexplored similarities among different CT traditions, perspectives, and study methods. These similarities could be exploited to open up an interdisciplinary dialogue among experts and build up a shared understanding of CT. Automatic assessment methods can enhance the use of open-ended measures for CT assessment, especially in online teaching. Indeed, they can support teachers and researchers to deal with the growing presence of linguistic data produced within educational 4 platforms. To this end, it is pivotal to develop automatic methods for the evaluation of large amounts of data which would be impossible to analyse manually, providing teachers an
    corecore