4,966 research outputs found

    Automated scoring of writing

    Get PDF
    For decades, automated essay scoring (AES) has operated behind the scenes of major standardized writing assessments to provide summative scores of students’ writing proficiency (Dikli in J Technol Learn Assess 5(1), 2006). Today, AES systems are increasingly used in low-stakes assessment contexts and as a component of instructional tools in writing classrooms. Despite substantial debate regarding their use, including concerns about writing construct representation (Condon in Assess Writ 18:100–108, 2013; Deane in Assess Writ 18:7–24, 2013), AES has attracted the attention of school administrators, educators, testing companies, and researchers and is now commonly used in an attempt to reduce human efforts and improve consistency issues in assessing writing (Ramesh and Sanampudi in Artif Intell Rev 55:2495–2527, 2021). This chapter introduces the affordances and constraints of AES for writing assessment, surveys research on AES effectiveness in classroom practice, and emphasizes implications for writing theory and practice.Englis

    A Study on the Effectiveness of Automated Essay Marking in the Context of a Blended Learning Course Design

    Get PDF
    This paper reports on a study undertaken in a Chinese university in order to investigate the effectiveness of an online automated essay marking system in the context of a Blended Learning course design. Two groups of undergraduate learners studying English were required to write essays as part of their normal course. One group had their essays marked by an online automated essay marking and feedback system, the second, control group were marked by a tutor who provided feedback in the normal way. Their essay scores and attitudes to the essay writing tasks were compared. It was found that learners were not disadvantaged by the automated essay marking system. Their mean performance was better (p<0.01) than the tutor marked control for seven of the essays and showed no difference for three essays. In no case did the tutor marked essay group score higher than the automated system. Correlations were performed that indicated that for both groups there was a significant improvement in performance (p<0.05) over the duration of the course and that there was a significant relationship between essay scores for the groups (p<0.01). An investigation of attitude to the automated system as compared to the tutor marked system was more complex. It was found that there was a significant difference in the attitudes of those classified as low and high performers (p<0.05). In the discussion these findings are placed in a Blended Learning context

    Automatic assessment of text-based responses in post-secondary education: A systematic review

    Full text link
    Text-based open-ended questions in academic formative and summative assessments help students become deep learners and prepare them to understand concepts for a subsequent conceptual assessment. However, grading text-based questions, especially in large courses, is tedious and time-consuming for instructors. Text processing models continue progressing with the rapid development of Artificial Intelligence (AI) tools and Natural Language Processing (NLP) algorithms. Especially after breakthroughs in Large Language Models (LLM), there is immense potential to automate rapid assessment and feedback of text-based responses in education. This systematic review adopts a scientific and reproducible literature search strategy based on the PRISMA process using explicit inclusion and exclusion criteria to study text-based automatic assessment systems in post-secondary education, screening 838 papers and synthesizing 93 studies. To understand how text-based automatic assessment systems have been developed and applied in education in recent years, three research questions are considered. All included studies are summarized and categorized according to a proposed comprehensive framework, including the input and output of the system, research motivation, and research outcomes, aiming to answer the research questions accordingly. Additionally, the typical studies of automated assessment systems, research methods, and application domains in these studies are investigated and summarized. This systematic review provides an overview of recent educational applications of text-based assessment systems for understanding the latest AI/NLP developments assisting in text-based assessments in higher education. Findings will particularly benefit researchers and educators incorporating LLMs such as ChatGPT into their educational activities.Comment: 27 pages, 4 figures, 6 table

    Using the Developmental Path of Cause to Bridge the Gap between AWE Scores and Writing Teachers’ Evaluations

    Get PDF
    Supported by artificial intelligence (AI), the most advanced Automatic Writing Evaluation (AWE) systems have gained increasing attention for their ability to provide immediate scoring and formative feedback, yet teachers have been hesitant to implement them into their classes because correlations between the grades they assign and the AWE scores have generally been low. This begs the question of where improvements in evaluation may need to be made, and what approaches are available to carry out this improvement. This mixed-method study involved 59 cause and effect essays collected from English language learners enrolled in six different sections of a college level academic writing course and utilized theory proposed by Slater and Mohan (2010) regarding the developmental path of cause. The study compared the results of raters who used this developmental path with the accuracy of AWE scores produced by Criterion, an AWE tool developed by Educational Testing Service (ETS), and the grades reported by teachers.Findings suggested that if Criterion is to be used successfully in the classroom, writing teachers need to take a meaning-based approach to their assessment, which would allow them and their students to understand more fully how language constructs cause and effect. Using the developmental path of cause as an analytical framework for assessment may then help teachers assign grades that are more in sync with AWE scores, which in turn can help students gain more trust in the scores they receive from both their teachers and Criterion

    Defining and Assessing Critical Thinking: toward an automatic analysis of HiEd students’ written texts

    Get PDF
    L'obiettivo principale di questa tesi di dottorato ù testare, attraverso due studi empirici, l'affidabilità di un metodo volto a valutare automaticamente le manifestazioni del Pensiero Critico (CT) nei testi scritti da studenti universitari. Gli studi empirici si sono basati su una review critica della letteratura volta a proporre una nuova classificazione per sistematizzare le diverse definizioni di CT e i relativi approcci teorici. La review esamina anche la relazione tra le diverse definizioni di CT e i relativi metodi di valutazione. Dai risultati emerge la necessità di concentrarsi su misure aperte per la valutazione del CT e di sviluppare strumenti automatici basati su tecniche di elaborazione del linguaggio naturale (NLP) per superare i limiti attuali delle misure aperte, come l’attendibilità e i costi di scoring. Sulla base di una rubrica sviluppata e implementata dal gruppo di ricerca del Centro di Didattica Museale – Università di Roma Tre (CDM) per la valutazione e l'analisi dei livelli di CT all'interno di risposte aperte (Poce, 2017), ù stato progettato un prototipo per la misurazione automatica di alcuni indicatori di CT. Il primo studio empirico condotto su un gruppo di 66 docenti universitari mostra livelli di affidabilità soddisfacenti della rubrica di valutazione, mentre la valutazione effettuata dal prototipo non era sufficientemente attendibile. I risultati di questa sperimentazione sono stati utilizzati per capire come e in quali condizioni il modello funziona meglio. La seconda indagine empirica era volta a capire quali indicatori del linguaggio naturale sono maggiormente associati a sei sottodimensioni del CT, valutate da esperti in saggi scritti in lingua italiana. Lo studio ha utilizzato un corpus di 103 saggi pre-post di studenti universitari di laurea magistrale che hanno frequentato il corso di "Pedagogia sperimentale e valutazione scolastica". All'interno del corso, sono state proposte due attività per stimolare il CT degli studenti: la valutazione delle risorse educative aperte (OER) (obbligatoria e online) e la progettazione delle OER (facoltativa e in modalità blended). I saggi sono stati valutati sia da valutatori esperti, considerando sei sotto-dimensioni del CT, sia da un algoritmo che misura automaticamente diversi tipi di indicatori del linguaggio naturale. Abbiamo riscontrato un'affidabilità interna positiva e un accordo tra valutatori medio-alto. I livelli di CT degli studenti sono migliorati in modo significativo nel post-test. Tre indicatori del linguaggio naturale sono 5 correlati in modo significativo con il punteggio totale di CT: la lunghezza del corpus, la complessità della sintassi e la funzione di peso tf-idf (term frequency–inverse document frequency). I risultati raccolti durante questo dottorato hanno implicazioni sia teoriche che pratiche per la ricerca e la valutazione del CT. Da un punto di vista teorico, questa tesi mostra sovrapposizioni inesplorate tra diverse tradizioni, prospettive e metodi di studio del CT. Questi punti di contatto potrebbero costituire la base per un approccio interdisciplinare e la costruzione di una comprensione condivisa di CT. I metodi di valutazione automatica possono supportare l’uso di misure aperte per la valutazione del CT, specialmente nell'insegnamento online. Possono infatti facilitare i docenti e i ricercatori nell'affrontare la crescente presenza di dati linguistici prodotti all'interno di piattaforme educative (es. Learning Management Systems). A tal fine, ù fondamentale sviluppare metodi automatici per la valutazione di grandi quantità di dati che sarebbe impossibile analizzare manualmente, fornendo agli insegnanti e ai valutatori un supporto per il monitoraggio e la valutazione delle competenze dimostrate online dagli studenti.The main goal of this PhD thesis is to test, through two empirical studies, the reliability of a method aimed at automatically assessing Critical Thinking (CT) manifestations in Higher Education students’ written texts. The empirical studies were based on a critical review aimed at proposing a new classification for systematising different CT definitions and their related theoretical approaches. The review also investigates the relationship between the different adopted CT definitions and CT assessment methods. The review highlights the need to focus on open-ended measures for CT assessment and to develop automatic tools based on Natural Language Processing (NLP) technique to overcome current limitations of open-ended measures, such as reliability and costs. Based on a rubric developed and implemented by the Center for Museum Studies – Roma Tre University (CDM) research group for the evaluation and analysis of CT levels within open-ended answers (Poce, 2017), a NLP prototype for the automatic measurement of CT indicators was designed. The first empirical study was carried out on a group of 66 university teachers. The study showed satisfactory reliability levels of the CT evaluation rubric, while the evaluation carried out by the prototype was not yet sufficiently reliable. The results were used to understand how and under what conditions the model works better. The second empirical investigation was aimed at understanding which NLP features are more associated with six CT sub-dimensions as assessed by human raters in essays written in the Italian language. The study used a corpus of 103 students’ pre-post essays who attended a Master's Degree module in “Experimental Education and School Assessment” to assess students' CT levels. Within the module, we proposed two activities to stimulate students' CT: Open Educational Resources (OERs) assessment (mandatory and online) and OERs design (optional and blended). The essays were assessed both by expert evaluators, considering six CT sub-dimensions, and by an algorithm that automatically calculates different kinds of NLP features. The study shows a positive internal reliability and a medium to high inter-coder agreement in expert evaluation. Students' CT levels improved significantly in the post-test. Three NLP indicators significantly correlate with CT total score: the Corpus Length, the Syntax Complexity, and an adapted measure of Term Frequency- Inverse Document Frequency. The results collected during this PhD have both theoretical and practical implications for CT research and assessment. From a theoretical perspective, this thesis shows unexplored similarities among different CT traditions, perspectives, and study methods. These similarities could be exploited to open up an interdisciplinary dialogue among experts and build up a shared understanding of CT. Automatic assessment methods can enhance the use of open-ended measures for CT assessment, especially in online teaching. Indeed, they can support teachers and researchers to deal with the growing presence of linguistic data produced within educational 4 platforms. To this end, it is pivotal to develop automatic methods for the evaluation of large amounts of data which would be impossible to analyse manually, providing teachers an

    Technology and Testing

    Get PDF
    From early answer sheets filled in with number 2 pencils, to tests administered by mainframe computers, to assessments wholly constructed by computers, it is clear that technology is changing the field of educational and psychological measurement. The numerous and rapid advances have immediate impact on test creators, assessment professionals, and those who implement and analyze assessments. This comprehensive new volume brings together leading experts on the issues posed by technological applications in testing, with chapters on game-based assessment, testing with simulations, video assessment, computerized test development, large-scale test delivery, model choice, validity, and error issues. Including an overview of existing literature and ground-breaking research, each chapter considers the technological, practical, and ethical considerations of this rapidly-changing area. Ideal for researchers and professionals in testing and assessment, Technology and Testing provides a critical and in-depth look at one of the most pressing topics in educational testing today

    Beyond the design of automated writing evaluation: Pedagogical practices and perceived learning effectiveness in EFL writing classes

    Get PDF
    Automated writing evaluation (AWE) software is designed to provide instant computer-generated scores for a submitted essay along with diagnostic feedback. Most studies on AWE have been conducted on psychometric evaluations of its validity; however, studies on how effectively AWE is used in writing classes as a pedagogical tool are limited. This study employs a naturalistic classroom-based approach to explore the interaction between how an AWE program, MY Access!, was implemented in three different ways in three EFL college writing classes in Taiwanand how students perceived its effectiveness in improving writing. The findings show that, although the implementation of AWE was not in general perceived very positively by the three classes, it was perceived comparatively more favorably when the program was used to facilitate students’ early drafting and revising process, followed by human feedback from both the teacher and peers during the later process. This study also reveals that the autonomous use of AWE as a surrogate writing coach with minimal human facilitation caused frustration to students and limited their learning of writing. In addition, teachers’ attitudes toward AWE use and their technology-use skills, as well as students’ learner characteristics and goals for learning to write, may also play vital roles in determining the effectiveness of AWE. With limitations inherent in the design of AWE technology, language teachers need to be more critically aware that the implementation of AWE requires well thought-out pedagogical designs and thorough considerations for its relevance to the objectives of the learning of writing

    The role of feedback in the processes and outcomes of academic writing in english as a foreign language at intermediate and advanced levels

    Get PDF
    Providing feedback on students’ texts is one of the essential components of teaching second language writing. However, whether and to what extent students benefit from feedback has been an issue of considerable debate in the literature. While many researchers have stressed its importance, others expressed doubts about its effectiveness. Regardless of these continuing and well-established debates, instructors consider feedback as a worthwhile pedagogical practice for second language learning. Based on this premise, I conducted three experimental studies to investigate the role of written feedback in Myanmar and Hungarian tertiary EFL classrooms. Additionally, I studied syntactic features and language-related error patterns in Hungarian and Myanmar students’ writing. This attempt was made to understand how students with different writing proficiency acted upon teacher and automated feedback. The first study examined the efficacy of feedback on Myanmar students’ writing over a 13-week semester and how automated feedback provided by Grammarly could be integrated into writing instruction as an assistance tool for writing teachers. Results from pre-and post-tests demonstrated that students’ writing performance improved along the lines of four assessment criteria: task achievement, coherence and cohesion, grammatical range and accuracy, and lexical range and accuracy. Further results from a written feedback analysis revealed that the free version of Grammarly provided feedback on lower-level writing issues such as articles and prepositions, whereas teacher feedback covered both lower-and higher-level writing concerns. These findings suggested a potential for integrating automated feedback into writing instruction. As limited attention was given to how feedback influences other aspects of writing development beyond accuracy, the second study examined how feedback influences the syntactic complexity of Myanmar students’ essays. Results from paired sample t-tests revealed no significant differences in the syntactic complexity of students’ writing when the comparison was made between initial and revised texts and between pre-and post-tests. These findings suggested that feedback on students’ writing does not lead them to write less structurally complex texts despite not resulting in syntactic complexity gains. The syntactic complexity of students’ revised texts varied among high-, mid-, and low-achieving students. These variations could be attributed to proficiency levels, writing prompts, genre differences, and feedback sources. The rationale for conducting the third study was based on the theoretical orientation that differential success in learners’ gaining from feedback largely depended on their engagement with the feedback rather than the feedback itself. Along these lines of research, I examined Hungarian students’ behavioural engagement (i.e., students’ uptake or revisions prompted by written feedback) with teacher and automated feedback in an EFL writing course. In addition to the engagement with form-focused feedback examined in the first study, I considered meaning-focused feedback, as feedback in a writing course typically covers both linguistic and rhetorical aspects of writing. The results showed differences in feedback focus (the teacher provided form-and meaning-focused feedback) with unexpected outcomes: students’ uptake of feedback resulted in moderate to low levels of engagement with feedback. Participants incorporated more form-focused feedback than meaning-focused feedback into their revisions. These findings contribute to our understanding of students’ engagement with writing tasks, levels of trust, and the possible impact of students’ language proficiency on their engagement with feedback. Following the results that Myanmar and Hungarian students responded to feedback on their writing differently, I designed a follow-up study to compare syntactic features of their writing as indices of their English writing proficiency. In addition, I examined language-related errors in their texts to capture the differences in the error patterns in the two groups. Results from paired sample t-tests showed that most syntactic complexity indices distinguished the essays produced by the two groups: length of production units, sentence complexity, and subordination indices. Similarly, statistically significant differences were found in language-related error patterns in their texts: errors were more prevalent in Myanmar students’ essays. The implications for research and pedagogical practices in EFL writing classes are discussed with reference to the rationale for each study

    A robust methodology for automated essay grading

    Get PDF
    None of the available automated essay grading systems can be used to grade essays according to the National Assessment Program – Literacy and Numeracy (NAPLAN) analytic scoring rubric used in Australia. This thesis is a humble effort to address this limitation. The objective of this thesis is to develop a robust methodology for automatically grading essays based on the NAPLAN rubric by using heuristics and rules based on English language and neural network modelling

    The Effects of Teacher Feedback and Automated Feedback on Cognitive and Psychological Aspects of Foreign Language Writing: A Mixed-Methods Research

    Get PDF
    Feedback plays a crucial role in the writing processes. However, in the literature on foreign language (FL) writing, there is a dearth of studies that compare the effects of teacher feedback and automated feedback on both cognitive and psychological aspects of FL writing. To fill this gap, the current study compared the effects of teacher feedback and automated feedback on both revision quality and writing proficiency development (i.e., the cognitive aspects), and perceived usefulness and perceived ease of use of the feedback (i.e., the psychological aspects) in English writing among English learners as an FL (EFLs) in China. It also investigated students’ perceptions of the strengths and weaknesses of the two types of feedback. The study adopted a mixed-methods design. The quantitative method collected the data through (1) a pre-test and a post-test, which measured the participants’ English writing proficiency development; (2) a writing task, which received either teacher feedback or automated feedback; and (3) a close-ended questionnaire, which examined students’ perceived usefulness and perceived ease of use of the feedback. The qualitative method collected the data through an open-ended questionnaire, which examined the participants’ perceptions of the strengths and weaknesses of teacher feedback or automated feedback depending on the type of feedback they received. Chinese university EFLs in two English classes (n = 35 in each class) taught by the same English teacher participated in the study: one class received teacher feedback while the other received automated feedback using Pigaiwang. While the students in the two classes did not differ significantly on the pre-test of students’ writing proficiency, students who received teacher feedback scored significantly higher on revision than those who received automated feedback. Students in the teacher feedback class also had significantly higher ratings on perceived usefulness and perceived ease of use of the feedback than those in the automated feedback class. However, students in the automated feedback class obtained significantly higher scores on the post-test of the writing proficiency. The qualitative results identified three themes of strengths and two themes of weaknesses for the teacher feedback and the automated feedback, respectively. The results suggest that while teacher feedback has a more positive effect on the psychological aspect of FL writing, automated feedback may be more effective in developing FL writing proficiency in the long run
    • 

    corecore