5,766 research outputs found

    Automated assessment of non-native learner essays: Investigating the role of linguistic features

    Get PDF
    Automatic essay scoring (AES) refers to the process of scoring free text responses to given prompts, considering human grader scores as the gold standard. Writing such essays is an essential component of many language and aptitude exams. Hence, AES became an active and established area of research, and there are many proprietary systems used in real life applications today. However, not much is known about which specific linguistic features are useful for prediction and how much of this is consistent across datasets. This article addresses that by exploring the role of various linguistic features in automatic essay scoring using two publicly available datasets of non-native English essays written in test taking scenarios. The linguistic properties are modeled by encoding lexical, syntactic, discourse and error types of learner language in the feature set. Predictive models are then developed using these features on both datasets and the most predictive features are compared. While the results show that the feature set used results in good predictive models with both datasets, the question "what are the most predictive features?" has a different answer for each dataset.Comment: Article accepted for publication at: International Journal of Artificial Intelligence in Education (IJAIED). To appear in early 2017 (journal url: http://www.springer.com/computer/ai/journal/40593

    An Efficient Probabilistic Deep Learning Model for the Oral Proficiency Assessment of Student Speech Recognition and Classification

    Get PDF
    Natural Language Processing is a branch of artificial intelligence (AI) that focuses on the interaction between computers and human language. Speech recognition systems utilize machine learning algorithms and statistical models to analyze acoustic features of speech, such as pitch, duration, and frequency, to convert spoken words into written text. The Student English Oral Proficiency Assessment and Feedback System provides students with a comprehensive evaluation of their spoken English skills and offers tailored feedback to help them improve. It can be used in language learning institutions, universities, or online platforms to support language education and enhance oral communication abilities. In this paper constructed a framework stated as Latent Dirichlet Integrated Deep Learning (LDiDL) for the assessment of student English proficiency assessment. The system begins by collecting a comprehensive dataset of spoken English samples, encompassing various proficiency levels. Relevant features are extracted from the samples, including acoustic characteristics and linguistic attributes. Leveraging Latent Dirichlet Allocation (LDA), the system uncovers latent topics within the data, enabling a deeper understanding of the underlying themes present in the spoken English. To further enhance the analysis, a deep learning model is developed, integrating the LDA topics with the extracted features. This model is trained using appropriate techniques and evaluated using performance metrics. Utilizing the predictions made by the model, the system generates personalized feedback for each student, focusing on areas of improvement such as vocabulary, grammar, fluency, and pronunciation. Simulation mode uses the native English speech audio for the LDiDL training and classification. The experimental analysis stated that the proposed LDiDL model achieves an accuracy of 99% for the assessment of English Proficiency

    Predicting ESL learners’ oral proficiency by measuring the collocations in their spontaneous speech

    Get PDF
    Collocation, known as words that commonly co-occur, is a major category of formulaic language. There is now general consensus among language researchers that collocation is essential to effective language use in real-world communication situations (Ellis, 2008; Nesselhauf, 2005; Schmitt, 2010; Wray, 2002). Although a number of contemporary speech-processing theories assume the importance of formulaic language to spontaneous speaking (Bygate, 1987; de Bot, 1992; Kormos, 2006; Levelt, 1999), none of them gives an adequate explanation of the role that collocation plays in speech communication. In the practices of L2 speaking assessment, a test taker’s collocational performance is usually not separately scored mainly because human raters can only focus on a limited range of speech characteristics (Luoma, 2004). This paper argues for the centrality of collocation evaluation to communication-oriented L2 oral assessment. Based on a logical analysis of the conceptual connections among collocation, speech-processing theories, and rubrics for oral language assessment, the author formulated a new construct called Spoken Collocational Competence (SCC). In light of Skehan’s (1998, 2009) trade-off hypothesis, he developed a series of measures for SCC, namely Operational Collocational Performance Measures (OCPMs), to cover three dimensions of learner collocation performance in spontaneous speaking: collocation accuracy, collocation complexity, and collocation fluency. He then investigated the empirical performance of these measures with 2344 lexical collocations extracted from sixty adult English as a second language (ESL) learners’ oral assessment data collected in two distinctive contexts of language use: conversing with an interlocutor on daily-life topics (or the SPEAK exam) and giving an academic lecture (or the TEACH exam). Multiple regression and logistic regression were performed on criterion measures of these learners’ oral proficiency (i.e., human holistic scores and oral proficiency certification decisions) as a function of the OCPMs. The study found that the participants generally achieved higher collocation accuracy and complexity in the TEACH exam than in the SPEAK exam. In addition, the OCPMs as a whole predicted the participants’ oral proficiency certification status (certified or uncertified) with high accuracy (Negelkerke R2 = .968). However, the predictive power of OCPMs for human holistic scores seemed to be higher in the SPEAK exam (adjusted R2 = .678) than in the TEACH exam (adjusted R2 = .573). These findings suggest that L2 learners’ collocational performance in free speech deserve examiners’ closer attention and that SCC may contribute to the construct of oral proficiency somewhat differently across speaking contexts. Implications for L2 speaking theory, automated speech evaluation, and teaching and learning of oral communication skills are discussed

    Development and validation of an automated essay scoring engine to assess students’ development across program levels

    Get PDF
    As English as a second language (ESL) populations in English-speaking countries continue to grow steadily, the need for methods of accounting for students’ academic success in college has become increasingly self-evident. Holistic assessment practices often lead to subjective and vague descriptions of learner language level, such as beginner, intermediate, advanced (Ellis & Larsen-Freeman, 2006). Objective measurements (e.g., the number of error-free T-units) used in second language production and proficiency research provide precise specifications of students’ development (Housen, Kuiken, & Vedder, 2012; Norris & Ortega, 2009; Wolfe-Quintero, Inagaki, & Kim, 1998); however, the process of obtaining a profile of a student’s development by using these objective measures requires many resources, especially time. In the ESL writing curriculum, class sizes are frequently expanding and instructors’ workloads are often high (Kellogg, Whiteford, & Quinlan, 2010); thus, time is at its limits, making the accountability for students’ development difficult to manage. The purpose of this research is to develop and validate an automated essay scoring (AES) engine to address the need for resources that provide precise descriptions of students’ writing development. Development of the engine utilizes measures of complexity, accuracy, fluency, and functionality (CAFF), which are guided by Complexity Theory and Systemic Functional Linguistics. These measures were built into computer algorithms by using a hybrid approach to natural language processing (NLP), which includes the statistical parsing of student texts and rule-based feature detection. Validation follows an interpretive argument-based approach to demonstrate the adequacy and appropriateness of AES scores. Results provide a mixed set of validity evidence both for and against the use of CAFFite measures for assessing development. Findings are meaningful for continued development and expansion of the AES engine into a tool that provides individualized diagnostic feedback for theory- and data-driven teaching and learning. The results also underscore the possibilities of using computerized writing assessment for measuring, collecting, analyzing, and reporting data about learners and their contexts to understand and optimize learning and teaching

    Technology and Testing

    Get PDF
    From early answer sheets filled in with number 2 pencils, to tests administered by mainframe computers, to assessments wholly constructed by computers, it is clear that technology is changing the field of educational and psychological measurement. The numerous and rapid advances have immediate impact on test creators, assessment professionals, and those who implement and analyze assessments. This comprehensive new volume brings together leading experts on the issues posed by technological applications in testing, with chapters on game-based assessment, testing with simulations, video assessment, computerized test development, large-scale test delivery, model choice, validity, and error issues. Including an overview of existing literature and ground-breaking research, each chapter considers the technological, practical, and ethical considerations of this rapidly-changing area. Ideal for researchers and professionals in testing and assessment, Technology and Testing provides a critical and in-depth look at one of the most pressing topics in educational testing today

    Assessment and Testing: Overview

    Get PDF
    What language testing does is to compel attention to meaning of ideas in linguistics and applied linguistics. Until they are put into operation, described, and explained, ideas remain ambiguous and fugitive. A test forces choice, removes ambiguity, and reveals what has been elusive: thus a test is the most explicit form of description, on the basis of which the tester comes clean about his/ her ideas. (Davies, 1990, p. 2) These words express what many applied linguists recognize: that language assessment and testing intersects almost all language-related issues that applied linguists study..

    Meeting the Challenges to Measurement in an Era of Accountability

    Get PDF
    Under pressure and support from the federal government, states have increasingly turned to indicators based on student test scores to evaluate teachers and schools, as well as students themselves. The focus thus far has been on test scores in those subject areas where there is a sequence of consecutive tests, such as in mathematics or English/language arts with a focus on grades 4-8. Teachers in these subject areas, however, constitute less than thirty percent of the teacher workforce in a district. Comparatively little has been written about the measurement of achievement in the other grades and subjects. This volume seeks to remedy this imbalance by focusing on the assessment of student achievement in a broad range of grade levels and subject areas, with particular attention to their use in the evaluation of teachers and schools in all. It addresses traditional end-of-course tests, as well as alternative measures such as portfolios, exhibitions, and student learning objectives. In each case, issues related to design and development, psychometric considerations, and validity challenges are covered from both a generic and a content-specific perspective. The NCME Applications of Educational Measurement and Assessment series includes edited volumes designed to inform research-based applications of educational measurement and assessment. Edited by leading experts, these books are comprehensive and practical resources on the latest developments in the field. The NCME series editorial board is comprised of Michael J. Kolen, Chair; Robert L. Brennan; Wayne Camara; Edward H. Haertel; Suzanne Lane; and Rebecca Zwick

    Modeling statistics ITAs’ speaking performances in a certification test

    Get PDF
    In light of the ever-increasing capability of computer technology and advancement in speech and natural language processing techniques, automated speech scoring of constructed responses is gaining popularity in many high-stakes assessment and low-stakes educational settings. Automated scoring is a highly interdisciplinary and complex subject, and there is much unknown about the strengths and weaknesses of automated speech scoring systems (Evanini & Zechner, 2020). Research in automated speech scoring has been centralized around a few proprietary systems owned by large testing companies. Consequently, existing systems only serve large-scale standardized assessment purposes. Application of automated scoring technologies in local assessment contexts is much desired but rarely realized because the system’s inner workings have remained unfamiliar to many language assessment professionals. Moreover, assumptions about the reliability of human scores, on which automated scoring systems are trained, are untenable in many local assessment situations, where a myriad of factors would work together to co-determine the human scores. These factors may include the rating design, the test takers’ abilities, and the raters’ specific rating behaviors (e.g., severity/leniency, internal consistency, and application of the rating scale). In an attempt to apply automated scoring procedures to a local context, the primary purpose of this study is to develop and evaluate an appropriate automated speech scoring model for a local certification test of international teaching assistants (ITAs). To meet this goal, this study first implemented feature extraction and selection based on existing automated speech scoring technologies and the scoring rubric of the local speaking test. Then, the reliability of the human ratings was investigated based on both Classical Test Theory (CTT) and Item Response Theory (IRT) frameworks, focusing on detecting potential rater effects that could negatively impact the quality of the human scores. Finally, by experimenting and comparing a series of statistical modeling options, this study investigated the extent to which the association between the automatically extracted features and the human scores could be statistically modeled to offer a mechanism that reflects the multifaceted nature of the performance assessment in a unified statistical framework. The extensive search for the speech or linguistic features, covering the sub-domains of fluency, pronunciation, rhythm, vocabulary, grammar, content, and discourse cohesion, revealed that a small set of useful variables could be identified. A large number of features could be effectively summarized as single latent factors that showed reasonably high associations with the human scores. Reliability analysis of human scoring indicated that both inter-rater reliability and intra-rater reliability were acceptable, and through a fine-grained IRT analysis, several raters who were prone to the central tendency or randomness effects were identified. Model fit indices, model performance in prediction, and model diagnostics results in the statistical modeling indicated that the most appropriate approach to model the relationship between the features and the final human scores was a cumulative link model (CLM). In contrast, the most appropriate approach to model the relationship between the features and the ratings from the multiple raters was a cumulative link mixed model (CLMM). These models suggested that higher ability levels were significantly related to the lapse of time, faster speech with fewer disfluencies, more varied and sophisticated vocabulary, more complex syntactic structures, and fewer rater effects. Based on the model’s prediction on unseen data, the rating-level CLMM achieved an accuracy of 0.64, a Pearson correlation of 0.58, and a quadratically-weighted kappa of 0.57, as compared to the human ratings on the 3-point scale. Results from this study could be used to inform the development, design, and implementation for a prototypical automated scoring system for prospective ITAs, as well as providing empirical evidence for future scale development, rater training, and support for assessment-related instruction for the testing program and diagnostic feedback for the ITA test takers

    An argument-based validation study of the English Placement Test (EPT) – Focusing on the inferences of extrapolation and ramification

    Get PDF
    English placement tests have been widely used in higher education as post-admission assessment instruments to measure admitted English as a second language (ESL) students’ English proficiency or readiness in academic English, usually upon their arrival at universities in English-speaking countries. Unlike commercial standardized English proficiency tests, many English placement tests are locally developed with comparatively limited resources and are relatively under-investigated in the field of language testing. Even less attention has been directed to the score interpretation and the impact of placement decisions on ESL students’ English learning and academic achievement. Undoubtedly, this scarcity of research on English placement tests is inappropriate in view of their status as one of the most frequently used language testing instruments, which may exert immediate and strong impact on ESL students’ learning in general. By employing a mixed-methods approach, this dissertation project investigates the validity of test score interpretation and use of the English Placement Test (EPT) used at Iowa State University (ISU) under an argument-based validity framework. More specifically, this study started with an interpretation and use argument for the EPT, which states the score meaning and intended impact of the EPT explicitly, and focused on the last two inferences in the interpretation and use argument, namely extrapolation and ramification. The extrapolation inference links expected scores of the EPT (scores that exhibit adequate test reliability) to target scores or actual performance in the target domain. In this study, the extrapolation inference requires investigation of the relationship between ESL students’ English placement test performance and two external criteria of English performance, including the TOEFL iBT and a self-assessment. The ramification inference links the use of the EPT results to its actual impact and in this study the ramification inference requires investigation of the impact of the placement decisions in a specific educational context. For the extrapolation inference, quantitative data such as test performance data on the EPT, the TOEFL iBT, and the self-assessment were collected and analyzed using multitrait-multimethod (MTMM) analysis techniques. The findings indicated that the EPT was found to have moderate relationships with the TOEFL iBT and weak to moderate relationships with the self-assessment. The EPT showed some of the expected convergent correlations as well as discriminant correlations based on the MTMM correlation coefficient matrix as well as the factor loading parameters in a correlated trait-correlated uniqueness (CTCU) model. For the ramification inference, three types of analyses were conducted to seek support with regard to 1) test stakeholders’ perceptions of the EPT placement decisions, 2) the impact of the EPT placement on ESL students’ English learning, and 3) the relationship between ESL students’ EPT performance and their first-semester academic achievement. The interviews with test stakeholders were coded and analyzed to identify statements indicating their perceptions of the impact of the placement decisions. The qualitative findings are also utilized to help interpret the quantitative findings. Multiple paired-samples t-tests are used to investigate ESL students’ progress in the ESL courses that they were placed into. In addition, a structural equation modeling (SEM) approach was used to model the relationship among students’ performance on the EPT, ESL courses, and their first-semester GPA, mediated by individual difference constructs, such as learning motivation, academic self-efficacy, and self-regulated learning strategies. The qualitative analyses of the interviews with four groups of test stakeholders show that the interviewed ESL students in general experienced initial frustration regarding the placement decisions, in retrospect, they understood why they were placed into ESL courses and appreciated the benefits of taking the required courses, especially ESL writing courses. The ESL course instructors were satisfied with the placement accuracy, even though occasionally they identified a few cases of potentially misplaced students in the ESL courses. The interviewed undergraduate academic advisors showed positive perceptions of the EPT and the placement decisions. They also reported observing that the majority of the ESL advisees were receptive to the EPT placement decisions. The analyses of ESL course performance data collected at the beginning and the end of the course indicate that ESL students in Engl99L, an ESL listening course focusing on listening strategies, made statistically significant progress in terms of score gain on the same listening test administered at two time points. However, only nine out of 38 ESL students made satisfactory progress with reference to the course standard. Students in Engl101B (a lower-level ESL academic English writing course) and Engl101C (a higher-level ESL academic English writing course) did not show much progress in terms of lexical complexity, syntactic complexity, and grammatical accuracy. However, the Engl101C students on average wrote longer essays at the end of the course. Based on the ratings of the essays written in the final exams using the EPT scoring rubric, 14 out 18 Engl101B students (77.8%) and eight out of 16 Engl101C students (50%) showed satisfactory progress in these classes and were deemed ready for the next level of English study. The SEM analysis results indicate that ESL students’ EPT performance had significant and direct impact on their academic achievement. What’s more, students’ EPT performance predicted their academic self-efficacy and affected extrinsic goal orientation. However, these motivational factors did not have direct impact on academic achievement. The findings in this study contribute to building the validity argument for the EPT with two of the assumptions underlying the warrant for the extrapolation inference and ramification inference found supported and the other three partially supported. This findings in this study contributed to a better understanding of the score interpretation and use of the EPT at Iowa State University through constructing a validity argument. These findings shed light on the future development of the EPT and other similar English placement tests. The findings in this study as well as the research methodology can be informative for other institutions where English placement tests are used
    • …
    corecore