22 research outputs found
Adversarial Data Augmentation Using VAE-GAN for Disordered Speech Recognition
Automatic recognition of disordered speech remains a highly challenging task
to date. The underlying neuro-motor conditions, often compounded with
co-occurring physical disabilities, lead to the difficulty in collecting large
quantities of impaired speech required for ASR system development. This paper
presents novel variational auto-encoder generative adversarial network
(VAE-GAN) based personalized disordered speech augmentation approaches that
simultaneously learn to encode, generate and discriminate synthesized impaired
speech. Separate latent features are derived to learn dysarthric speech
characteristics and phoneme context representations. Self-supervised
pre-trained Wav2vec 2.0 embedding features are also incorporated. Experiments
conducted on the UASpeech corpus suggest the proposed adversarial data
augmentation approach consistently outperformed the baseline speed perturbation
and non-VAE GAN augmentation methods with trained hybrid TDNN and End-to-end
Conformer systems. After LHUC speaker adaptation, the best system using VAE-GAN
based augmentation produced an overall WER of 27.78% on the UASpeech test set
of 16 dysarthric speakers, and the lowest published WER of 57.31% on the subset
of speakers with "Very Low" intelligibility.Comment: Submitted to ICASSP 202
Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition using Adversarial Data Augmentation
Automatic recognition of dysarthric speech remains a highly challenging task
to date. Neuro-motor conditions and co-occurring physical disabilities create
difficulty in large-scale data collection for ASR system development. Adapting
SSL pre-trained ASR models to limited dysarthric speech via data-intensive
parameter fine-tuning leads to poor generalization. To this end, this paper
presents an extensive comparative study of various data augmentation approaches
to improve the robustness of pre-trained ASR model fine-tuning to dysarthric
speech. These include: a) conventional speaker-independent perturbation of
impaired speech; b) speaker-dependent speed perturbation, or GAN-based
adversarial perturbation of normal, control speech based on their time
alignment against parallel dysarthric speech; c) novel Spectral basis GAN-based
adversarial data augmentation operating on non-parallel data. Experiments
conducted on the UASpeech corpus suggest GAN-based data augmentation
consistently outperforms fine-tuned Wav2vec2.0 and HuBERT models using no data
augmentation and speed perturbation across different data expansion operating
points by statistically significant word error rate (WER) reductions up to
2.01% and 0.96% absolute (9.03% and 4.63% relative) respectively on the
UASpeech test set of 16 dysarthric speakers. After cross-system outputs
rescoring, the best system produced the lowest published WER of 16.53% (46.47%
on very low intelligibility) on UASpeech.Comment: To appear at IEEE ICASSP 202
Hyper-parameter Adaptation of Conformer ASR Systems for Elderly and Dysarthric Speech Recognition
Automatic recognition of disordered and elderly speech remains highly
challenging tasks to date due to data scarcity. Parameter fine-tuning is often
used to exploit the large quantities of non-aged and healthy speech pre-trained
models, while neural architecture hyper-parameters are set using expert
knowledge and remain unchanged. This paper investigates hyper-parameter
adaptation for Conformer ASR systems that are pre-trained on the Librispeech
corpus before being domain adapted to the DementiaBank elderly and UASpeech
dysarthric speech datasets. Experimental results suggest that hyper-parameter
adaptation produced word error rate (WER) reductions of 0.45% and 0.67% over
parameter-only fine-tuning on DBank and UASpeech tasks respectively. An
intuitive correlation is found between the performance improvements by
hyper-parameter domain adaptation and the relative utterance length ratio
between the source and target domain data.Comment: 5 pages, 3 figures, 3 tables, accepted by Interspeech202
Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition
Accurate recognition of cocktail party speech containing overlapping
speakers, noise and reverberation remains a highly challenging task to date.
Motivated by the invariance of visual modality to acoustic signal corruption,
an audio-visual multi-channel speech separation, dereverberation and
recognition approach featuring a full incorporation of visual information into
all system components is proposed in this paper. The efficacy of the video
input is consistently demonstrated in mask-based MVDR speech separation,
DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end and
Conformer ASR back-end. Audio-visual integrated front-end architectures
performing speech separation and dereverberation in a pipelined or joint
fashion via mask-based WPD are investigated. The error cost mismatch between
the speech enhancement front-end and ASR back-end components is minimized by
end-to-end jointly fine-tuning using either the ASR cost function alone, or its
interpolation with the speech enhancement loss. Experiments were conducted on
the mixture overlapped and reverberant speech data constructed using simulation
or replay of the Oxford LRS2 dataset. The proposed audio-visual multi-channel
speech separation, dereverberation and recognition systems consistently
outperformed the comparable audio-only baseline by 9.1% and 6.2% absolute
(41.7% and 36.0% relative) word error rate (WER) reductions. Consistent speech
enhancement improvements were also obtained on PESQ, STOI and SRMR scores.Comment: IEEE/ACM Transactions on Audio, Speech, and Language Processin
Exploring Self-supervised Pre-trained ASR Models For Dysarthric and Elderly Speech Recognition
Automatic recognition of disordered and elderly speech remains a highly
challenging task to date due to the difficulty in collecting such data in large
quantities. This paper explores a series of approaches to integrate domain
adapted SSL pre-trained models into TDNN and Conformer ASR systems for
dysarthric and elderly speech recognition: a) input feature fusion between
standard acoustic frontends and domain adapted wav2vec2.0 speech
representations; b) frame-level joint decoding of TDNN systems separately
trained using standard acoustic features alone and with additional wav2vec2.0
features; and c) multi-pass decoding involving the TDNN/Conformer system
outputs to be rescored using domain adapted wav2vec2.0 models. In addition,
domain adapted wav2vec2.0 representations are utilized in
acoustic-to-articulatory (A2A) inversion to construct multi-modal dysarthric
and elderly speech recognition systems. Experiments conducted on the UASpeech
dysarthric and DementiaBank Pitt elderly speech corpora suggest TDNN and
Conformer ASR systems integrated domain adapted wav2vec2.0 models consistently
outperform the standalone wav2vec2.0 models by statistically significant WER
reductions of 8.22% and 3.43% absolute (26.71% and 15.88% relative) on the two
tasks respectively. The lowest published WERs of 22.56% (52.53% on very low
intelligibility, 39.09% on unseen words) and 18.17% are obtained on the
UASpeech test set of 16 dysarthric speakers, and the DementiaBank Pitt test set
respectively.Comment: accepted by ICASSP 202
Two-pass Decoding and Cross-adaptation Based System Combination of End-to-end Conformer and Hybrid TDNN ASR Systems
Fundamental modelling differences between hybrid and end-to-end (E2E)
automatic speech recognition (ASR) systems create large diversity and
complementarity among them. This paper investigates multi-pass rescoring and
cross adaptation based system combination approaches for hybrid TDNN and
Conformer E2E ASR systems. In multi-pass rescoring, state-of-the-art hybrid
LF-MMI trained CNN-TDNN system featuring speed perturbation, SpecAugment and
Bayesian learning hidden unit contributions (LHUC) speaker adaptation was used
to produce initial N-best outputs before being rescored by the speaker adapted
Conformer system using a 2-way cross system score interpolation. In cross
adaptation, the hybrid CNN-TDNN system was adapted to the 1-best output of the
Conformer system or vice versa. Experiments on the 300-hour Switchboard corpus
suggest that the combined systems derived using either of the two system
combination approaches outperformed the individual systems. The best combined
system obtained using multi-pass rescoring produced statistically significant
word error rate (WER) reductions of 2.5% to 3.9% absolute (22.5% to 28.9%
relative) over the stand alone Conformer system on the NIST Hub5'00, Rt03 and
Rt02 evaluation data.Comment: It' s accepted to ISCA 202
Bayesian neural network language modeling for speech recognition
State-of-the-art neural network language models (NNLMs) represented by long short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming highly complex. They are prone to overfitting and poor generalization when given limited training data. To this end, an overarching full Bayesian learning framework encompassing three methods is proposed in this paper to account for the underlying uncertainty in LSTM-RNN and Transformer LMs. The uncertainty over their model parameters, choice of neural activations and hidden output representations are modeled using Bayesian, Gaussian Process and variational LSTM-RNN or Transformer LMs respectively. Efficient inference approaches were used to automatically select the optimal network internal components to be Bayesian learned using neural architecture search. A minimal number of Monte Carlo parameter samples as low as one was also used. These allow the computational costs incurred in Bayesian NNLM training and evaluation to be minimized. Experiments are conducted on two tasks: AMI meeting transcription and Oxford-BBC LipReading Sentences 2 (LRS2) overlapped speech recognition using state-of-the-art LF-MMI trained factored TDNN systems featuring data augmentation, speaker adaptation and audio-visual multi-channel beamforming for overlapped speech. Consistent performance improvements over the baseline LSTM-RNN and Transformer LMs with point estimated model parameters and drop-out regularization were obtained across both tasks in terms of perplexity and word error rate (WER). In particular, on the LRS2 data, statistically significant WER reductions up to 1.3% and 1.2% absolute (12.1% and 11.3% relative) were obtained over the baseline LSTM-RNN and Transformer LMs respectively after model combination between Bayesian NNLMs and their respective baselines.This work was supported in part by Hong Kong Research Council GRF under Grants 14200218, 14200220, and 14200021 and in part by Innovation and Technology Fund under Grants ITS/254/19 and InP/057/21