4,900 research outputs found
Enhancing Automatic Chinese Essay Scoring System from Figures-of-Speech
PACLIC 20 / Wuhan, China / 1-3 November, 200
Automatic assessment of text-based responses in post-secondary education: A systematic review
Text-based open-ended questions in academic formative and summative
assessments help students become deep learners and prepare them to understand
concepts for a subsequent conceptual assessment. However, grading text-based
questions, especially in large courses, is tedious and time-consuming for
instructors. Text processing models continue progressing with the rapid
development of Artificial Intelligence (AI) tools and Natural Language
Processing (NLP) algorithms. Especially after breakthroughs in Large Language
Models (LLM), there is immense potential to automate rapid assessment and
feedback of text-based responses in education. This systematic review adopts a
scientific and reproducible literature search strategy based on the PRISMA
process using explicit inclusion and exclusion criteria to study text-based
automatic assessment systems in post-secondary education, screening 838 papers
and synthesizing 93 studies. To understand how text-based automatic assessment
systems have been developed and applied in education in recent years, three
research questions are considered. All included studies are summarized and
categorized according to a proposed comprehensive framework, including the
input and output of the system, research motivation, and research outcomes,
aiming to answer the research questions accordingly. Additionally, the typical
studies of automated assessment systems, research methods, and application
domains in these studies are investigated and summarized. This systematic
review provides an overview of recent educational applications of text-based
assessment systems for understanding the latest AI/NLP developments assisting
in text-based assessments in higher education. Findings will particularly
benefit researchers and educators incorporating LLMs such as ChatGPT into their
educational activities.Comment: 27 pages, 4 figures, 6 table
A robust methodology for automated essay grading
None of the available automated essay grading systems can be used to grade essays according to the National Assessment Program – Literacy and Numeracy (NAPLAN) analytic scoring rubric used in Australia. This thesis is a humble effort to address this limitation. The objective of this thesis is to develop a robust methodology for automatically grading essays based on the NAPLAN rubric by using heuristics and rules based on English language and neural network modelling
Validity Arguments for Diagnostic Assessment Using Automated Writing Evaluation
Two examples demonstrate an argument-based approach to validation of diagnostic assessment using automated writing evaluation (AWE). Criterion ®, was developed by Educational Testing Service to analyze students’ papers grammatically, providing sentence-level error feedback. An interpretive argument was developed for its use as part of the diagnostic assessment process in undergraduate university English for academic purposes (EAP) classes. The Intelligent Academic Discourse Evaluator (IADE) was developed for use in graduate EAP university classes, where the goal was to help students improve their discipline-specific writing. The validation for each was designed to support claims about the intended purposes of the assessments. We present the interpretive argument for each and show some of the data that have been gathered as backing for the respective validity arguments, which include the range of inferences that one would make in claiming validity of the interpretations, uses, and consequences of diagnostic AWE-based assessments
Sentiment and Sentence Similarity as Predictors of Integrated and Independent L2 Writing Performance
This study aimed to utilize sentiment and sentence similarity analyses, two Natural Language Processing techniques, to see if and how well they could predict L2 Writing Performance in integrated and independent task conditions. The data sources were an integrated L2 writing corpus of 185 literary analysis essays and an independent L2 writing corpus of 500 argumentative essays, both of which were compiled in higher education contexts. Both essay groups were scored between 0 and 100. Two Python libraries, TextBlob and SpaCy, were used to generate sentiment and sentence similarity data. Using sentiment (polarity and subjectivity) and sentence similarity variables, regression models were built and 95% prediction intervals were compared for integrated and independent corpora. The results showed that integrated L2 writing performance could be predicted by subjectivity and sentence similarity. However, only subjectivity predicted independent L2 writing performance. The prediction interval of subjectivity for independent writing model was found to be narrower than the same interval for integrated writing. The results show that the sentiment and sentence similarity analysis algorithms can be used to generate complementary data to improve more complex multivariate L2 writing performance prediction models
MoBiL: A hybrid feature set for Automatic Human Translation quality assessment
In this paper we introduce MoBiL, a hybrid Monolingual, Bilingual and Language modelling feature set and feature selection and evaluation framework. The set includes translation quality indicators that can be utilized to automatically predict the quality of human translations in terms of content adequacy and language fluency. We compare MoBiL with the QuEst baseline set by using them in classifiers trained with support vector machine and relevance vector machine learning algorithms on the same data set. We also report an experiment on feature selection to opt for fewer but more informative features from MoBiL. Our experiments show that classifiers trained on our feature set perform consistently better in predicting both adequacy and fluency than the classifiers trained on the baseline feature set. MoBiL also performs well when used with both support vector machine and relevance vector machine algorithms
Examining the predictive validity of the Duolingo English Test: Evidence from a major UK university
The COVID-19 pandemic has changed the university admissions and proficiency testing landscape. One change has been the meteoric rise in use of the fully automated Duolingo English Test (DET) for university entrance purposes, offering test-takers a cheaper, shorter, accessible alternative. This rapid response study is the first to investigate the predictive value of DET test scores in relation to university students’ academic attainment, taking into account students’ degree level, academic discipline, and nationality. We also compared DET test-takers’ academic performance with that of students admitted using traditional proficiency tests. Credit-weighted first-year academic grades of 1881 DET test-takers (1389 postgraduate, 492 undergraduate) enrolled at a large, research-intensive London university in Autumn 2020 were positively associated with DET Overall scores for postgraduate students (adj. r = .195) but not undergraduate students (adj. r = −.112). This result was mirrored in correlational patterns for students admitted through IELTS (n = 2651) and TOEFL iBT (n = 436), contributing to criterion-related validity evidence. Students admitted with DET enjoyed lower academic success than the IELTS and TOEFL iBT test-takers, although sample characteristics may have shaped this finding. We discuss implications for establishing cut scores and harnessing test-takers’ academic language development through pre-sessional and in-sessional support
ArguGPT: evaluating, understanding and identifying argumentative essays generated by GPT models
AI generated content (AIGC) presents considerable challenge to educators
around the world. Instructors need to be able to detect such text generated by
large language models, either with the naked eye or with the help of some
tools. There is also growing need to understand the lexical, syntactic and
stylistic features of AIGC. To address these challenges in English language
teaching, we first present ArguGPT, a balanced corpus of 4,038 argumentative
essays generated by 7 GPT models in response to essay prompts from three
sources: (1) in-class or homework exercises, (2) TOEFL and (3) GRE writing
tasks. Machine-generated texts are paired with roughly equal number of
human-written essays with three score levels matched in essay prompts. We then
hire English instructors to distinguish machine essays from human ones. Results
show that when first exposed to machine-generated essays, the instructors only
have an accuracy of 61% in detecting them. But the number rises to 67% after
one round of minimal self-training. Next, we perform linguistic analyses of
these essays, which show that machines produce sentences with more complex
syntactic structures while human essays tend to be lexically more complex.
Finally, we test existing AIGC detectors and build our own detectors using SVMs
and RoBERTa. Results suggest that a RoBERTa fine-tuned with the training set of
ArguGPT achieves above 90% accuracy in both essay- and sentence-level
classification. To the best of our knowledge, this is the first comprehensive
analysis of argumentative essays produced by generative large language models.
Machine-authored essays in ArguGPT and our models will be made publicly
available at https://github.com/huhailinguist/ArguGP
- …