96 research outputs found
Computer analysis of children's non-native English speech for language learning and assessment
Children's ASR appears to be more challenging than adults' and it's even more difficult when it comes to non-native children's speech. This research investigates different techniques to compensate for the effects of non-native and children on the performance of ASR systems. The study mainly utilises hybrid DNN-HMM systems with conventional DNNs, LSTMs and more advanced TDNN models. This work uses the CALL-ST corpus and TLT-school corpus to study children's non-native English speech.
Initially, data augmentation was explored on the CALL-ST corpus to address the lack of data problem using the AMI corpus and PF-STAR German corpus. Feature selection, acoustic model adaptation and selection were also investigated on CALL-ST. More aspects of the ASR system, including pronunciation modelling, acoustic modelling, language modelling and system fusion, were explored on the TLT-school corpus as this corpus has a bigger amount of data. Then, the relationships between the CALL-ST and TLT-school corpora were studied and utilised to improve ASR performance.
The other part of the present work is text processing for non-native children's English speech. We focused on providing accept/reject feedback to learners based on the text generated by the ASR system from learners' spoken responses. A rule-based and a machine learning-based system were proposed for making the judgement, several aspects of the systems were evaluated. The influence of the ASR system on the text processing system was explored
Mismatched Training Data Enhancement for Automatic Recognition of Children’s Speech using DNN-HMM
The increasing profusion of commercial automatic speech recognition technology applications has been driven by big-data techniques, using high quality labelled speech datasets. Children's speech has greater time and frequency domain variability than typical adult speech, lacks good large scale training data, and presents difficulties relating to capture quality. Each of these factors reduces the performance of systems that automatically recognise children's speech. In this paper, children's speech recognition is investigated using a hybrid acoustic modelling approach based on deep neural networks and Gaussian mixture models with hidden Markov model back ends. We explore the incorporation of mismatched training data to achieve a better acoustic model and improve performance in the face of limited training data, as well as training data augmentation using noise. We also explore two arrangements for vocal tract length normalisation and a gender-based data selection technique suitable for training a children's speech recogniser
Can Generative Large Language Models Perform ASR Error Correction?
ASR error correction continues to serve as an important part of
post-processing for speech recognition systems. Traditionally, these models are
trained with supervised training using the decoding results of the underlying
ASR system and the reference text. This approach is computationally intensive
and the model needs to be re-trained when switching the underlying ASR model.
Recent years have seen the development of large language models and their
ability to perform natural language processing tasks in a zero-shot manner. In
this paper, we take ChatGPT as an example to examine its ability to perform ASR
error correction in the zero-shot or 1-shot settings. We use the ASR N-best
list as model input and propose unconstrained error correction and N-best
constrained error correction methods. Results on a Conformer-Transducer model
and the pre-trained Whisper model show that we can largely improve the ASR
system performance with error correction using the powerful ChatGPT model
Adapting an ASR Foundation Model for Spoken Language Assessment
A crucial part of an accurate and reliable spoken language assessment system
is the underlying ASR model. Recently, large-scale pre-trained ASR foundation
models such as Whisper have been made available. As the output of these models
is designed to be human readable, punctuation is added, numbers are presented
in Arabic numeric form and abbreviations are included. Additionally, these
models have a tendency to skip disfluencies and hesitations in the output.
Though useful for readability, these attributes are not helpful for assessing
the ability of a candidate and providing feedback. Here a precise transcription
of what a candidate said is needed. In this paper, we give a detailed analysis
of Whisper outputs and propose two solutions: fine-tuning and soft prompt
tuning. Experiments are conducted on both public speech corpora and an English
learner dataset. Results show that we can effectively alter the decoding
behaviour of Whisper to generate the exact words spoken in the response.Comment: Proceedings of SLaT
N-best T5: Robust ASR Error Correction using Multiple Input Hypotheses and Constrained Decoding Space
Error correction models form an important part of Automatic Speech
Recognition (ASR) post-processing to improve the readability and quality of
transcriptions. Most prior works use the 1-best ASR hypothesis as input and
therefore can only perform correction by leveraging the context within one
sentence. In this work, we propose a novel N-best T5 model for this task, which
is fine-tuned from a T5 model and utilizes ASR N-best lists as model input. By
transferring knowledge from the pre-trained language model and obtaining richer
information from the ASR decoding space, the proposed approach outperforms a
strong Conformer-Transducer baseline. Another issue with standard error
correction is that the generation process is not well-guided. To address this a
constrained decoding process, either based on the N-best list or an ASR
lattice, is used which allows additional information to be propagated.Comment: submitted to INTERSPEEC
Adapting an Unadaptable ASR System
As speech recognition model sizes and training data requirements grow, it is
increasingly common for systems to only be available via APIs from online
service providers rather than having direct access to models themselves. In
this scenario it is challenging to adapt systems to a specific target domain.
To address this problem we consider the recently released OpenAI Whisper ASR as
an example of a large-scale ASR system to assess adaptation methods. An error
correction based approach is adopted, as this does not require access to the
model, but can be trained from either 1-best or N-best outputs that are
normally available via the ASR API. LibriSpeech is used as the primary target
domain for adaptation. The generalization ability of the system in two distinct
dimensions are then evaluated. First, whether the form of correction model is
portable to other speech recognition domains, and secondly whether it can be
used for ASR models having a different architecture.Comment: submitted to INTERSPEEC
Towards spoken dialect identification of Irish
The Irish language is rich in its diversity of dialects and accents. This
compounds the difficulty of creating a speech recognition system for the
low-resource language, as such a system must contend with a high degree of
variability with limited corpora. A recent study investigating dialect bias in
Irish ASR found that balanced training corpora gave rise to unequal dialect
performance, with performance for the Ulster dialect being consistently worse
than for the Connacht or Munster dialects. Motivated by this, the present
experiments investigate spoken dialect identification of Irish, with a view to
incorporating such a system into the speech recognition pipeline. Two acoustic
classification models are tested, XLS-R and ECAPA-TDNN, in conjunction with a
text-based classifier using a pretrained Irish-language BERT model. The
ECAPA-TDNN, particularly a model pretrained for language identification on the
VoxLingua107 dataset, performed best overall, with an accuracy of 73%. This was
further improved to 76% by fusing the model's outputs with the text-based
model. The Ulster dialect was most accurately identified, with an accuracy of
94%, however the model struggled to disambiguate between the Connacht and
Munster dialects, suggesting a more nuanced approach may be necessary to
robustly distinguish between the dialects of Irish.Comment: Accepted to Interspeech 2023 Workshop of the 2nd Annual Meeting of
the Special Interest Group of Under-resourced Languages Workshop, Dublin
(SiGUL
Towards dialect-inclusive recognition in a low-resource language: are balanced corpora the answer?
ASR systems are generally built for the spoken 'standard', and their
performance declines for non-standard dialects/varieties. This is a problem for
a language like Irish, where there is no single spoken standard, but rather
three major dialects: Ulster (Ul), Connacht (Co) and Munster (Mu). As a
diagnostic to quantify the effect of the speaker's dialect on recognition
performance, 12 ASR systems were trained, firstly using baseline
dialect-balanced training corpora, and then using modified versions of the
baseline corpora, where dialect-specific materials were either subtracted or
added. Results indicate that dialect-balanced corpora do not yield a similar
performance across the dialects: the Ul dialect consistently underperforms,
whereas Mu yields lowest WERs. There is a close relationship between Co and Mu
dialects, but one that is not symmetrical. These results will guide future
corpus collection and system building strategies to optimise for cross-dialect
performance equity.Comment: Accepted to Interspeech 2023, Dubli
- …