Search CORE

29 research outputs found

Recommended from our members

Improving multiple-crowd-sourced transcriptions using a speech recogniser

Author: Gales MJF
Knill KM
Tsiakoulis P
Van Dalen RC
Publication venue: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Publication date: 01/04/2015
Field of study

This paper introduces a method to produce high-quality transcrip- tions of speech data from only two crowd-sourced transcriptions. These transcriptions, produced cheaply by people on the Internet, for example through Amazon Mechanical Turk, are often of low qual- ity. Often, multiple crowd-sourced transcriptions are combined to form one transcription of higher quality. However, the state of the art is to use essentially a form of majority voting, which requires at least three transcriptions for each utterance. This paper shows how to refine this approach to work with only two transcriptions. It then introduces a method that uses a speech recogniser (bootstrapped on a simple combination scheme) to combine transcriptions. When only two crowd-sourced transcriptions are available, on a noisy data set this improves the word error rate to gold-standard transcriptions by 21 % relative.This paper reports on research supported by Cambridge English, University of Cambridge.This is the accepted manuscript of a paper that will be published in the Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. It is currently under an infinite embargo

Apollo (Cambridge)

IMPROVING MULTIPLE-CROWD-SOURCED TRANSCRIPTIONS USING A SPEECH RECOGNISER

Author: K M Knill
M J F Gales
P Tsiakoulis
R C Van Dalen
Publication venue
Publication date: 03/04/2020
Field of study

ABSTRACT This paper introduces a method to produce high-quality transcriptions of speech data from only two crowd-sourced transcriptions. These transcriptions, produced cheaply by people on the Internet, for example through Amazon Mechanical Turk, are often of low quality. Often, multiple crowd-sourced transcriptions are combined to form one transcription of higher quality. However, the state of the art is to use essentially a form of majority voting, which requires at least three transcriptions for each utterance. This paper shows how to refine this approach to work with only two transcriptions. It then introduces a method that uses a speech recogniser (bootstrapped on a simple combination scheme) to combine transcriptions. When only two crowd-sourced transcriptions are available, on a noisy data set this improves the word error rate to gold-standard transcriptions by 21 % relative

CiteSeerX

Recommended from our members

Automatically Grading Learners’ English Using a Gaussian Process

Author: Gales MJF
Knill KM
van Dalen RC
Publication venue: Speech and Language Technology in Education, SLaTE 2015
Publication date: 04/08/2015
Field of study

There is a high demand around the world for the learning of English as a second language. Correspondingly, there is a need to assess the proficiency level of learners both during their studies and for formal qualifications. A number of automatic methods have been proposed to help meet this demand with varying degrees of success. This paper considers the automatic assessment of spoken English proficiency, which is still a challenging problem. In this scenario, the grader should be able to accurately assess the learner’s ability level from spontaneous, prompted, speech, independent of L1 language and the quality of the audio recording. Automatic graders are potentially more consistent than humans. However, the validity of the predicted grade varies. This paper proposes an automatic grader based on a Gaussian process. The advantage of using a Gaussian process is that as well as predicting a grade, it provides a measure of the uncertainty of its prediction. The uncertainty measure is sufficiently accurate to decide which automatic grades should be re-graded by humans. It can also be used to determine which candidates are hard to grade for humans and therefore need expert grading. Performance of the automatic grader is shown to be close to human graders on real candidate entries. Interpolation of human and GP grades further boosts performance.This work was supported by Cambridge English, University of Cambridge.This is the author accepted manuscript. The final version is available from ISCA via http://www.isca-speech.org/archive/slate_2015/sl15_007.htm

Apollo (Cambridge)

Sequence Teacher-Student Training of Acoustic Models for Automatic Free Speaking Language Assessment

Author: Gales MJF
Knill KM
Ragni A
Wang Y
Wong JHM
Publication venue: 2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings
Publication date: 01/01/2018
Field of study

A high performance automatic speech recognition (ASR) system is an important constituent component of an automatic language assessment system for free speaking language tests. The ASR system is required to be capable of recognising non-native spontaneous English speech and to be deployable under real-time conditions. The performance of ASR systems can often be significantly improved by leveraging upon multiple systems that are complementary, such as an ensemble. Ensemble methods, however, can be computationally expensive, often requiring multiple decoding runs, which makes them impractical for deployment. In this paper, a lattice-free implementation of sequence-level teacher-student training is used to reduce this computational cost, thereby allowing for real-time applications. This method allows a single student model to emulate the performance of an ensemble of teachers, but without the need for multiple decoding runs. Adaptations of the student model to speakers from different first languages (L1s) and grades are also explored.Cambridge Assessment Englis

Crossref

Apollo (Cambridge)

White Rose Research Online

Recommended from our members

Towards automatic assessment of spontaneous spoken English

Author: Gales MJF
Knill KM
Kyriakopoulos K
Malinin A
Rashid M
van Dalen RC
Wang Y
Publication venue: Speech Communication
Publication date: 01/01/2018
Field of study

With increasing global demand for learning English as a second language, there has been considerable interest in methods of automatic assessment of spoken language proficiency for use in interactive electronic learning tools as well as for grading candidates for formal qualifications. This paper presents an automatic system to address the assessment of spontaneous spoken language. Prompts or questions requiring spontaneous speech responses elicit more natural speech which better reflects a learner’s proficiency level than read speech. In addition to the challenges of highly variable non-native, learner, speech and noisy real-world recording conditions, this requires any automatic system to handle disfluent, non-grammatical, spontaneous speech with the underlying text unknown. To handle these, a strong deep learning based speech recognition system is applied in combination with a Gaussian Process (GP) grader. A range of features derived from the audio using the recognition hypothesis are investigated for their efficacy in the automatic grader. The proposed system is shown to predict grades at a similar level to the original examiner graders on real candidate entries. Interpolation with the examiner grades further boosts performance. The ability to reject poorly estimated grades is also important and measures are proposed to evaluate the performance of rejection schemes. The GP variance is used to decide which automatic grades should be rejected. Back-off to an expert grader for the least confident grades gives gains.Cambridge Assessment Englis

Apollo (Cambridge)

Automatic detection of accent and lexical pronunciation errors in spontaneous non-native English speech

Author: Gales MJF
Knill KM
Kyriakopoulos K
Publication venue: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Publication date: 01/01/2020
Field of study

Detecting individual pronunciation errors and diagnosing pronunciation error tendencies in a language learner based on their speech are important components of computer-aided language learning (CALL). The tasks of error detection and error tendency diagnosis become particularly challenging when the speech in question is spontaneous and particularly given the challenges posed by the inconsistency of human annotation of pronunciation errors. This paper presents an approach to these tasks by distinguishing between lexical errors, wherein the speaker does not know how a particular word is pronounced, and accent errors, wherein the candidate's speech exhibits consistent patterns of phone substitution, deletion and insertion. Three annotated corpora of non-native English speech by speakers of multiple L1s are analysed, the consistency of human annotation investigated and a method presented for detecting individual accent and lexical errors and diagnosing accent error tendencies at the speaker level

Crossref

Apollo (Cambridge)

Recommended from our members

A hierarchical attention based model for off-topic spontaneous spoken response detection

Author: Gales MJF
Knill K
Malinin A
Publication venue: 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings
Publication date: 01/01/2017
Field of study

Automatic spoken language assessment and training systems are becoming increasingly popular to handle the growing demand to learn languages. However, current systems often assess only fluency and pronunciation, with limited content-based features being used. This paper examines one particular aspect of content-assessment, off-topic response detection. This is important for deployed systems as it ensures that candidates understood the prompt, and are able to generate an appropriate answer. Previously proposed approaches typically require a set of prompt-response training pairs, which lim- its flexibility as example responses are required whenever a new test prompt is introduced. Recently, the attention based neural topic model (ATM) was presented, which can assess the relevance of prompt-response pairs regardless of whether the prompt was seen in training. This model uses a bidirectional Recurrent Neural Network (BiRNN) embedding of the prompt combined with an attention mechanism to attend over the hidden states of a BiRNN embedding of the response to compute a fixed-length embedding used to predict relevance. Unfortunately, performance on prompts not seen in the training data is lower than on seen prompts. Thus, this paper adds the following contributions: several im- provements to the ATM are examined; a hierarchical variant of the ATM (HATM) is proposed, which explicitly uses prompt similarity to further improve performance on unseen prompts by interpolating over prompts seen in training data given a prompt of interest via a second attention mechanism; an in-depth analysis of both models is conducted and main failure mode identified. On spontaneous spo- ken data, taken from BULATS tests, these systems are able to assess relevance to both seen and unseen prompt

Apollo (Cambridge)

A deep learning approach to automatic characterisation of rhythm in non-native English speech

Author: Gales MJF
Knill KM
Kyriakopoulos K
Publication venue: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Publication date: 01/01/2019
Field of study

A speaker's rhythm contributes to the intelligibility of their speech and can be characteristic of their language and accent. For non-native learners of a language, the extent to which they match its natural rhythm is an important predictor of their proficiency. As a learner improves, their rhythm is expected to become less similar to their L1 and more to the L2. Metrics based on the variability of the durations of vocalic and consonantal intervals have been shown to be effective at detecting language and accent. In this paper, pairwise variability (PVI, CCI) and variance (varcoV, varcoC) metrics are first used to predict proficiency and L1 of non-native speakers taking an English spoken exam. A deep learning alternative to generalise these features is then presented, in the form of a tunable duration embedding, based on attention over an RNN over durations. The RNN allows relationships beyond pairwise to be captured, while attention allows sensitivity to the different relative importance of durations. The system is trained end-to-end for proficiency and L1 prediction and compared to the baseline. The values of both sets of features for different proficiency levels are then visualised and compared to native speech in the L1 and the L2.ALTA Institut

Crossref

Apollo (Cambridge)

Recommended from our members

A deep learning approach to assessing non-native pronunciation of English using phone distances

Author: Gales MJF
Knill KM
Kyriakopoulos K
Publication venue: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Publication date: 01/01/2018
Field of study

The way a non-native speaker pronounces the phones of a language is an important predictor of their proficiency. In grading spontaneous speech, the pairwise distances between generative statistical models trained on each phone have been shown to be powerful features. This paper presents a deep learning alternative to model-based phone distances in the form of a tunable Siamese network feature extractor to extract distance metrics directly from the audio frame sequence. Features are extracted at the phone instance level and combined to phone-level representations using an attention mechanism. Pair-wise distances between phone features are then projected through a feed-forward layer to predict score. The extraction stage is initialised on either a binary phone instance-pair classification task, or to mimic the model-based features, then the whole system is fine-tuned end-to-end, optimising the learning of the distance metric to the score prediction task. This method is therefore more adaptable and more sensitive to phone instance level phenomena. Its performance is compared agains

Apollo (Cambridge)

Articulation rate as a metric in spoken language assessment

Author: Graham C
Nolan F
Publication venue: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Publication date: 01/01/2019
Field of study

Copyright © 2019 ISCA Automated evaluation of non-native pronunciation provides a consistent and more cost-efficient alternative to human evaluation. To that end, there is considerable interest in deriving metrics that are based on the cues human listeners use to judge pronunciation. Previous research reported the use of phonetic features such as vowel characteristics in automated spoken language evaluation. The present study extends this line of work on the significance of phonetic features in automated evaluation of L2 speech (both assessment and feedback). Predictive modelling techniques examined the relationship between various articulation rate metrics one the one hand, and the proficiency and L1 background of non-native English speakers on the other. It was found that the optimal predictive model was one in which the phonetic details of phoneme articulation were factored in the analysis of articulation rate. Model performance varied also according to the L1 background of speakers. The implications for assessment and feedback are discussed.Leverhulme ECF Fellowship; ALTA projec

Crossref

Apollo (Cambridge)