Search CORE

70 research outputs found

Adversarial Data Augmentation Using VAE-GAN for Disordered Speech Recognition

Author: Deng Jiajun
Geng Mengzhe
Hu Shujie
Jin Zengrui
Li Guinan
Liu Xunying
Wang Tianzi
Xie Xurong
Publication venue
Publication date: 03/11/2022
Field of study

Automatic recognition of disordered speech remains a highly challenging task to date. The underlying neuro-motor conditions, often compounded with co-occurring physical disabilities, lead to the difficulty in collecting large quantities of impaired speech required for ASR system development. This paper presents novel variational auto-encoder generative adversarial network (VAE-GAN) based personalized disordered speech augmentation approaches that simultaneously learn to encode, generate and discriminate synthesized impaired speech. Separate latent features are derived to learn dysarthric speech characteristics and phoneme context representations. Self-supervised pre-trained Wav2vec 2.0 embedding features are also incorporated. Experiments conducted on the UASpeech corpus suggest the proposed adversarial data augmentation approach consistently outperformed the baseline speed perturbation and non-VAE GAN augmentation methods with trained hybrid TDNN and End-to-end Conformer systems. After LHUC speaker adaptation, the best system using VAE-GAN based augmentation produced an overall WER of 27.78% on the UASpeech test set of 16 dysarthric speakers, and the lowest published WER of 57.31% on the subset of speakers with "Very Low" intelligibility.Comment: Submitted to ICASSP 202

arXiv.org e-Print Archive

Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition using Adversarial Data Augmentation

Author: Geng Mengzhe
Hu Shujie
Jin Zengrui
Li Guinan
Liu Xunying
Wang Huimeng
Wang Tianzi
Xu Haoning
Publication venue
Publication date: 31/12/2023
Field of study

Automatic recognition of dysarthric speech remains a highly challenging task to date. Neuro-motor conditions and co-occurring physical disabilities create difficulty in large-scale data collection for ASR system development. Adapting SSL pre-trained ASR models to limited dysarthric speech via data-intensive parameter fine-tuning leads to poor generalization. To this end, this paper presents an extensive comparative study of various data augmentation approaches to improve the robustness of pre-trained ASR model fine-tuning to dysarthric speech. These include: a) conventional speaker-independent perturbation of impaired speech; b) speaker-dependent speed perturbation, or GAN-based adversarial perturbation of normal, control speech based on their time alignment against parallel dysarthric speech; c) novel Spectral basis GAN-based adversarial data augmentation operating on non-parallel data. Experiments conducted on the UASpeech corpus suggest GAN-based data augmentation consistently outperforms fine-tuned Wav2vec2.0 and HuBERT models using no data augmentation and speed perturbation across different data expansion operating points by statistically significant word error rate (WER) reductions up to 2.01% and 0.96% absolute (9.03% and 4.63% relative) respectively on the UASpeech test set of 16 dysarthric speakers. After cross-system outputs rescoring, the best system produced the lowest published WER of 16.53% (46.47% on very low intelligibility) on UASpeech.Comment: To appear at IEEE ICASSP 202

arXiv.org e-Print Archive

Hyper-parameter Adaptation of Conformer ASR Systems for Elderly and Dysarthric Speech Recognition

Author: Deng Jiajun
Geng Mengzhe
Hu Shoukang
Jin Zengrui
Liu Xunying
Meng Helen
Wang Tianzi
Wang Yi
Publication venue
Publication date: 27/06/2023
Field of study

Automatic recognition of disordered and elderly speech remains highly challenging tasks to date due to data scarcity. Parameter fine-tuning is often used to exploit the large quantities of non-aged and healthy speech pre-trained models, while neural architecture hyper-parameters are set using expert knowledge and remain unchanged. This paper investigates hyper-parameter adaptation for Conformer ASR systems that are pre-trained on the Librispeech corpus before being domain adapted to the DementiaBank elderly and UASpeech dysarthric speech datasets. Experimental results suggest that hyper-parameter adaptation produced word error rate (WER) reductions of 0.45% and 0.67% over parameter-only fine-tuning on DBank and UASpeech tasks respectively. An intuitive correlation is found between the performance improvements by hyper-parameter domain adaptation and the relative utterance length ratio between the source and target domain data.Comment: 5 pages, 3 figures, 3 tables, accepted by Interspeech202

arXiv.org e-Print Archive

Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition

Author: Cui Mingyu
Deng Jiajun
Geng Mengzhe
Hu Shujie
Jin Zengrui
Li Guinan
Liu Xunying
Meng Helen
Wang Tianzi
Publication venue
Publication date: 06/07/2023
Field of study

Accurate recognition of cocktail party speech containing overlapping speakers, noise and reverberation remains a highly challenging task to date. Motivated by the invariance of visual modality to acoustic signal corruption, an audio-visual multi-channel speech separation, dereverberation and recognition approach featuring a full incorporation of visual information into all system components is proposed in this paper. The efficacy of the video input is consistently demonstrated in mask-based MVDR speech separation, DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end and Conformer ASR back-end. Audio-visual integrated front-end architectures performing speech separation and dereverberation in a pipelined or joint fashion via mask-based WPD are investigated. The error cost mismatch between the speech enhancement front-end and ASR back-end components is minimized by end-to-end jointly fine-tuning using either the ASR cost function alone, or its interpolation with the speech enhancement loss. Experiments were conducted on the mixture overlapped and reverberant speech data constructed using simulation or replay of the Oxford LRS2 dataset. The proposed audio-visual multi-channel speech separation, dereverberation and recognition systems consistently outperformed the comparable audio-only baseline by 9.1% and 6.2% absolute (41.7% and 36.0% relative) word error rate (WER) reductions. Consistent speech enhancement improvements were also obtained on PESQ, STOI and SRMR scores.Comment: IEEE/ACM Transactions on Audio, Speech, and Language Processin

arXiv.org e-Print Archive

Exploring Self-supervised Pre-trained ASR Models For Dysarthric and Elderly Speech Recognition

Author: Cui Mingyu
Deng Jiajun
Geng Mengzhe
Hu Shujie
Jin Zengrui
Liu Xunying
Meng Helen
Wang Yi
Xie Xurong
Publication venue
Publication date: 22/06/2023
Field of study

Automatic recognition of disordered and elderly speech remains a highly challenging task to date due to the difficulty in collecting such data in large quantities. This paper explores a series of approaches to integrate domain adapted SSL pre-trained models into TDNN and Conformer ASR systems for dysarthric and elderly speech recognition: a) input feature fusion between standard acoustic frontends and domain adapted wav2vec2.0 speech representations; b) frame-level joint decoding of TDNN systems separately trained using standard acoustic features alone and with additional wav2vec2.0 features; and c) multi-pass decoding involving the TDNN/Conformer system outputs to be rescored using domain adapted wav2vec2.0 models. In addition, domain adapted wav2vec2.0 representations are utilized in acoustic-to-articulatory (A2A) inversion to construct multi-modal dysarthric and elderly speech recognition systems. Experiments conducted on the UASpeech dysarthric and DementiaBank Pitt elderly speech corpora suggest TDNN and Conformer ASR systems integrated domain adapted wav2vec2.0 models consistently outperform the standalone wav2vec2.0 models by statistically significant WER reductions of 8.22% and 3.43% absolute (26.71% and 15.88% relative) on the two tasks respectively. The lowest published WERs of 22.56% (52.53% on very low intelligibility, 39.09% on unseen words) and 18.17% are obtained on the UASpeech test set of 16 dysarthric speakers, and the DementiaBank Pitt test set respectively.Comment: accepted by ICASSP 202

arXiv.org e-Print Archive

Two-pass Decoding and Cross-adaptation Based System Combination of End-to-end Conformer and Hybrid TDNN ASR Systems

Author: Cui Mingyu
Deng Jiajun
Geng Mengzhe
Hu Shoukang
Hu Shujie
Liu Xunying
Meng Helen
Wang Tianzi
Xie Xurong
Xue Boyang
Publication venue: 'International Speech Communication Association'
Publication date: 23/06/2022
Field of study

Fundamental modelling differences between hybrid and end-to-end (E2E) automatic speech recognition (ASR) systems create large diversity and complementarity among them. This paper investigates multi-pass rescoring and cross adaptation based system combination approaches for hybrid TDNN and Conformer E2E ASR systems. In multi-pass rescoring, state-of-the-art hybrid LF-MMI trained CNN-TDNN system featuring speed perturbation, SpecAugment and Bayesian learning hidden unit contributions (LHUC) speaker adaptation was used to produce initial N-best outputs before being rescored by the speaker adapted Conformer system using a 2-way cross system score interpolation. In cross adaptation, the hybrid CNN-TDNN system was adapted to the 1-best output of the Conformer system or vice versa. Experiments on the 300-hour Switchboard corpus suggest that the combined systems derived using either of the two system combination approaches outperformed the individual systems. The best combined system obtained using multi-pass rescoring produced statistically significant word error rate (WER) reductions of 2.5% to 3.9% absolute (22.5% to 28.9% relative) over the stand alone Conformer system on the NIST Hub5'00, Rt03 and Rt02 evaluation data.Comment: It' s accepted to ISCA 202

arXiv.org e-Print Archive

Rapid and Simple Detection of Viable Foodborne Pathogen Staphylococcus aureus

Author: Caiyan Liu
Chao Shi
Cuiping Ma
Mengyuan Wang
Mengzhe Li
Zonghua Wang
Publication venue: 'Frontiers Media SA'
Publication date: 01/03/2019
Field of study

Staphylococcus aureus (S. aureus) contamination in food safety has become a worldwide health problem. In this work, we utilized RNA one-step detection of denaturation bubble-mediated Strand Exchange Amplification (SEA) method to realize the detection of viable foodborne pathogen S. aureus. A pair of S. aureus specific primers were designed for the SEA reaction by targeting hypervariable V2 region of 16S rDNA and the amplification reaction was finished about 1 h. The results of amplification reaction could be observed by the naked eyes with a significant color change from light yellow to red to realize the colorimetric detection of S. aureus. Therefore, there only required an isothermal water bath, which was very popular for areas with limited resources. In real sample testing, although the SEA detection was so time-saving compared with the traditional plating method, the SEA method showed great consistency with the traditional plating method. In view of the above-described advantages, we provided a simple, rapid and equipment-free detection method, which had a great potential on ponit-of-care testing (POCT) application. Our method reported here will also provide a POCT detection platform for other food-borne pathogens in food, even pathogenic bacteria from other fields

Directory of Open Access Journals