1,391 research outputs found
Recommended from our members
Automatically Grading Learners’ English Using a Gaussian Process
There is a high demand around the world for the learning of English as a second language. Correspondingly, there is a need to assess the proficiency level of learners both during their studies and for formal qualifications. A number of automatic methods have been proposed to help meet this demand with varying degrees of success. This paper considers the automatic assessment of spoken English proficiency, which is still a challenging problem. In this scenario, the grader should be able to accurately assess the learner’s ability level from spontaneous, prompted, speech, independent of L1 language and the quality of the audio recording. Automatic graders are potentially more consistent than humans. However, the validity of the predicted grade varies. This paper proposes an automatic grader based on a Gaussian process. The advantage of using a Gaussian process is that as well as predicting a grade, it provides a measure of the uncertainty of its prediction. The uncertainty measure is sufficiently accurate to decide which automatic grades should be re-graded by humans. It can also be used to determine which candidates are hard to grade for humans and therefore need expert grading. Performance of the automatic grader is shown to be close to human graders on real candidate entries. Interpolation of human and GP grades further boosts performance.This work was supported by Cambridge English, University of Cambridge.This is the author accepted manuscript. The final version is available from ISCA via http://www.isca-speech.org/archive/slate_2015/sl15_007.htm
Recommended from our members
Deep Learning for Automatic Assessment and Feedback of Spoken English
Growing global demand for learning a second language (L2), particularly English, has led to
considerable interest in automatic spoken language assessment, whether for use in computerassisted language learning (CALL) tools or for grading candidates for formal qualifications.
This thesis presents research conducted into the automatic assessment of spontaneous nonnative English speech, with a view to be able to provide meaningful feedback to learners. One
of the challenges in automatic spoken language assessment is giving candidates feedback on
particular aspects, or views, of their spoken language proficiency, in addition to the overall
holistic score normally provided. Another is detecting pronunciation and other types of errors
at the word or utterance level and feeding them back to the learner in a useful way.
It is usually difficult to obtain accurate training data with separate scores for different
views and, as examiners are often trained to give holistic grades, single-view scores can
suffer issues of consistency. Conversely, holistic scores are available for various standard
assessment tasks such as Linguaskill. An investigation is thus conducted into whether
assessment scores linked to particular views of the speaker’s ability can be obtained from
systems trained using only holistic scores.
End-to-end neural systems are designed with structures and forms of input tuned to single
views, specifically each of pronunciation, rhythm, intonation and text. By training each
system on large quantities of candidate data, individual-view information should be possible
to extract. The relationships between the predictions of each system are evaluated to examine
whether they are, in fact, extracting different information about the speaker. Three methods
of combining the systems to predict holistic score are investigated, namely averaging their
predictions and concatenating and attending over their intermediate representations. The
combined graders are compared to each other and to baseline approaches.
The tasks of error detection and error tendency diagnosis become particularly challenging
when the speech in question is spontaneous and particularly given the challenges posed by
the inconsistency of human annotation of pronunciation errors. An approach to these tasks is
presented by distinguishing between lexical errors, wherein the speaker does not know how a
particular word is pronounced, and accent errors, wherein the candidate’s speech exhibits
consistent patterns of phone substitution, deletion and insertion. Three annotated corpora
x
of non-native English speech by speakers of multiple L1s are analysed, the consistency of
human annotation investigated and a method presented for detecting individual accent and
lexical errors and diagnosing accent error tendencies at the speaker level
Recommended from our members
Towards automatic assessment of spontaneous spoken English
With increasing global demand for learning English as a second language, there has been considerable interest in
methods of automatic assessment of spoken language proficiency for use in interactive electronic learning tools as
well as for grading candidates for formal qualifications. This paper presents an automatic system to address the
assessment of spontaneous spoken language. Prompts or questions requiring spontaneous speech responses elicit
more natural speech which better reflects a learner’s proficiency level than read speech. In addition to the challenges
of highly variable non-native, learner, speech and noisy real-world recording conditions, this requires any automatic
system to handle disfluent, non-grammatical, spontaneous speech with the underlying text unknown. To handle these,
a strong deep learning based speech recognition system is applied in combination with a Gaussian Process (GP)
grader. A range of features derived from the audio using the recognition hypothesis are investigated for their efficacy
in the automatic grader. The proposed system is shown to predict grades at a similar level to the original examiner
graders on real candidate entries. Interpolation with the examiner grades further boosts performance. The ability to
reject poorly estimated grades is also important and measures are proposed to evaluate the performance of rejection
schemes. The GP variance is used to decide which automatic grades should be rejected. Back-off to an expert grader
for the least confident grades gives gains.Cambridge Assessment Englis
Impact of ASR performance on free speaking language assessment
In free speaking tests candidates respond in spontaneous speech to prompts. This form of test allows the spoken language proficiency of a non-native speaker of English to be assessed more fully than read aloud tests. As the candidate's responses are unscripted, transcription by automatic speech recognition (ASR) is essential for automated assessment. ASR will never be 100% accurate so any assessment system must seek to minimise and mitigate ASR errors. This paper considers the impact of ASR errors on the performance of free speaking test auto-marking systems. Firstly rich linguistically related features, based on part-of-speech tags from statistical parse trees, are investigated for assessment. Then, the impact of ASR errors on how well the system can detect whether a learner's answer is relevant to the question asked is evaluated. Finally, the impact that these errors may have on the ability of the system to provide detailed feedback to the learner is analysed. In particular, pronunciation and grammatical errors are considered as these are important in helping a learner to make progress. As feedback resulting from an ASR error would be highly confusing, an approach to mitigate this problem using confidence scores is also analysed
Recommended from our members
Modelling text meta-properties in automated text scoring for non-native English writing
Automated text scoring (ATS) is the task of automatically scoring a text based on some given grading criteria. This thesis focuses on ATS in the context of free-text writing exams aimed at learners of English as a foreign language (EFL). The benefit of an ATS system is primarily to provide instant and consistent feedback to language learners, and service reliability also forms a crucial part of an ATS system. Based on previous work, we investigated only partially explored meta-properties in text and integrated them into a machine learning based ATS model across multiple datasets:
In most previous work, the proposed models implicitly assume that texts produced by learners in an exam are written independently. However, this is not true for the exams where learners are required to compose multiple texts. We hence explicitly instructed our model which texts are written by the same learner, which boosts model performance in most cases.
We used three intra-exam properties within the same exam including prompt, genre and task as a starting point, and we showed that explicitly modelling these properties via frustratingly easy domain adaptation (FEDA) can positively affect model performance in some cases. Furthermore, modelling multiple intra-exam properties together is better than modelling any single property individually or no property in four out of five test sets.
We studied how to utilise and combine learners' responses from multiple writing exams. We also proposed a new variant of the transfer-learning ATS model which mitigates the drawbacks of previous work. This variant first builds a ranking model across multiple datasets via FEDA, and the ranking score of each text predicted by the ranking model is used as an extra feature in the baseline model. This variants gives improvement compared to a baseline model on the development sets in terms of root-mean-square error. Furthermore, the transfer-learning model utilising multiple datasets tuned on each development set is always better than the baseline model on the corresponding test set.
We found that different datasets favour different meta properties. We therefore combined all the models looking at different meta properties together using ensemble learning. Compared to the baseline model, the combined model has a statistically significant improvement on all the test sets in terms of root-mean-square error based on a permutation test.The Institute for Automated Language Teaching and Assessmen
Incorporating uncertainty into deep learning for spoken language assessment
There is a growing demand for automatic
assessment of spoken English proficiency.
These systems need to handle large vari-
ations in input data owing to the wide
range of candidate skill levels and L1s, and
errors from ASR. Some candidates will
be a poor match to the training data set,
undermining the validity of the predicted
grade. For high stakes tests it is essen-
tial for such systems not only to grade
well, but also to provide a measure of
their uncertainty in their predictions, en-
abling rejection to human graders. Pre-
vious work examined Gaussian Process
(GP) graders which, though successful, do
not scale well with large data sets. Deep
Neural Networks (DNN) may also be used
to provide uncertainty using Monte-Carlo
Dropout (MCD). This paper proposes a
novel method to yield uncertainty and
compares it to GPs and DNNs with MCD.
The proposed approach
explicitly
teaches
a DNN to have low uncertainty on train-
ing data and high uncertainty on generated
artificial data. On experiments conducted
on data from the Business Language Test-
ing Service (BULATS), the proposed ap-
proach is found to outperform GPs and
DNNs with MCD in uncertainty-based re-
jection whilst achieving comparable grad-
ing performance
Use of graphemic lexicons for spoken language assessment
Copyright © 2017 ISCA. Automatic systems for practice and exams are essential to support the growing worldwide demand for learning English as an additional language. Assessment of spontaneous spoken English is, however, currently limited in scope due to the difficulty of achieving sufficient automatic speech recognition (ASR) accuracy. "Off-the-shelf" English ASR systems cannot model the exceptionally wide variety of accents, pronunications and recording conditions found in non-native learner data. Limited training data for different first languages (L1s), across all proficiency levels, often with (at most) crowd-sourced transcriptions, limits the performance of ASR systems trained on non-native English learner speech. This paper investigates whether the effect of one source of error in the system, lexical modelling, can be mitigated by using graphemic lexicons in place of phonetic lexicons based on native speaker pronunications. Graphemicbased English ASR is typically worse than phonetic-based due to the irregularity of English spelling-to-pronunciation but here lower word error rates are consistently observed with the graphemic ASR. The effect of using graphemes on automatic assessment is assessed on different grader feature sets: audio and fluency derived features, including some phonetic level features; and phone/grapheme distance features which capture a measure of pronunciation ability
Reflecting Comprehension through French Textual Complexity Factors
International audienceResearch efforts in terms of automatic textual complexity analysis are mainly focused on English vocabulary and few adaptations exist for other languages. Starting from a solid base in terms of discourse analysis and existing textual complexity assessment model for English, we introduce a French model trained on 200 documents extracted from school manuals pre-classified into five complexity classes. The underlying textual complexity metrics include surface, syntactic, morphological, semantic and discourse specific factors that are afterwards combined through the use of Support Vector Machines. In the end, each factor is correlated to pupil comprehension metrics scores, spanning throughout multiple classes, therefore creating a clearer perspective in terms of measurements impacting the perceived difficulty of a given text. In addition to purely quantitative surface factors, specific parts of speech and cohesion have proven to be reliable predictors of learners' comprehension level, creating nevertheless a strong background for building dependable French textual complexity models
- …