905 research outputs found
Hyper-parameter Adaptation of Conformer ASR Systems for Elderly and Dysarthric Speech Recognition
Automatic recognition of disordered and elderly speech remains highly
challenging tasks to date due to data scarcity. Parameter fine-tuning is often
used to exploit the large quantities of non-aged and healthy speech pre-trained
models, while neural architecture hyper-parameters are set using expert
knowledge and remain unchanged. This paper investigates hyper-parameter
adaptation for Conformer ASR systems that are pre-trained on the Librispeech
corpus before being domain adapted to the DementiaBank elderly and UASpeech
dysarthric speech datasets. Experimental results suggest that hyper-parameter
adaptation produced word error rate (WER) reductions of 0.45% and 0.67% over
parameter-only fine-tuning on DBank and UASpeech tasks respectively. An
intuitive correlation is found between the performance improvements by
hyper-parameter domain adaptation and the relative utterance length ratio
between the source and target domain data.Comment: 5 pages, 3 figures, 3 tables, accepted by Interspeech202
Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition
Accurate recognition of cocktail party speech containing overlapping
speakers, noise and reverberation remains a highly challenging task to date.
Motivated by the invariance of visual modality to acoustic signal corruption,
an audio-visual multi-channel speech separation, dereverberation and
recognition approach featuring a full incorporation of visual information into
all system components is proposed in this paper. The efficacy of the video
input is consistently demonstrated in mask-based MVDR speech separation,
DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end and
Conformer ASR back-end. Audio-visual integrated front-end architectures
performing speech separation and dereverberation in a pipelined or joint
fashion via mask-based WPD are investigated. The error cost mismatch between
the speech enhancement front-end and ASR back-end components is minimized by
end-to-end jointly fine-tuning using either the ASR cost function alone, or its
interpolation with the speech enhancement loss. Experiments were conducted on
the mixture overlapped and reverberant speech data constructed using simulation
or replay of the Oxford LRS2 dataset. The proposed audio-visual multi-channel
speech separation, dereverberation and recognition systems consistently
outperformed the comparable audio-only baseline by 9.1% and 6.2% absolute
(41.7% and 36.0% relative) word error rate (WER) reductions. Consistent speech
enhancement improvements were also obtained on PESQ, STOI and SRMR scores.Comment: IEEE/ACM Transactions on Audio, Speech, and Language Processin
Exploring Self-supervised Pre-trained ASR Models For Dysarthric and Elderly Speech Recognition
Automatic recognition of disordered and elderly speech remains a highly
challenging task to date due to the difficulty in collecting such data in large
quantities. This paper explores a series of approaches to integrate domain
adapted SSL pre-trained models into TDNN and Conformer ASR systems for
dysarthric and elderly speech recognition: a) input feature fusion between
standard acoustic frontends and domain adapted wav2vec2.0 speech
representations; b) frame-level joint decoding of TDNN systems separately
trained using standard acoustic features alone and with additional wav2vec2.0
features; and c) multi-pass decoding involving the TDNN/Conformer system
outputs to be rescored using domain adapted wav2vec2.0 models. In addition,
domain adapted wav2vec2.0 representations are utilized in
acoustic-to-articulatory (A2A) inversion to construct multi-modal dysarthric
and elderly speech recognition systems. Experiments conducted on the UASpeech
dysarthric and DementiaBank Pitt elderly speech corpora suggest TDNN and
Conformer ASR systems integrated domain adapted wav2vec2.0 models consistently
outperform the standalone wav2vec2.0 models by statistically significant WER
reductions of 8.22% and 3.43% absolute (26.71% and 15.88% relative) on the two
tasks respectively. The lowest published WERs of 22.56% (52.53% on very low
intelligibility, 39.09% on unseen words) and 18.17% are obtained on the
UASpeech test set of 16 dysarthric speakers, and the DementiaBank Pitt test set
respectively.Comment: accepted by ICASSP 202
Two-pass Decoding and Cross-adaptation Based System Combination of End-to-end Conformer and Hybrid TDNN ASR Systems
Fundamental modelling differences between hybrid and end-to-end (E2E)
automatic speech recognition (ASR) systems create large diversity and
complementarity among them. This paper investigates multi-pass rescoring and
cross adaptation based system combination approaches for hybrid TDNN and
Conformer E2E ASR systems. In multi-pass rescoring, state-of-the-art hybrid
LF-MMI trained CNN-TDNN system featuring speed perturbation, SpecAugment and
Bayesian learning hidden unit contributions (LHUC) speaker adaptation was used
to produce initial N-best outputs before being rescored by the speaker adapted
Conformer system using a 2-way cross system score interpolation. In cross
adaptation, the hybrid CNN-TDNN system was adapted to the 1-best output of the
Conformer system or vice versa. Experiments on the 300-hour Switchboard corpus
suggest that the combined systems derived using either of the two system
combination approaches outperformed the individual systems. The best combined
system obtained using multi-pass rescoring produced statistically significant
word error rate (WER) reductions of 2.5% to 3.9% absolute (22.5% to 28.9%
relative) over the stand alone Conformer system on the NIST Hub5'00, Rt03 and
Rt02 evaluation data.Comment: It' s accepted to ISCA 202
Enhancing Literacy Education with Narrative Richness in the Metaverse
Through an education-centric metaverse learning application, this research aims to assess the use of narrative richness to deliver media, language, and sustainability literacy education. The 21st-century learning needs require teaching and learning resources to be shared and managed more effectively across institutions. The use of metaverse features can help to manage varying narrative richness to boost learning reflection and attitude. Despite its potential, it is unclear how narrative richness in the metaverse can enhance teaching and learning. The study proposed in this research, which includes institutions from four Asian countries, is driven by this knowledge and evidence gap. Module leaders conceptualize and evaluate a purpose-built metaverse-learning application to produce rich and realistic learning experiences. We utilize narratives to enhance the realism of learning experiences and will assess the effects of narrative richness on learning reflection and attitude
Investigating a Potential Causal Relationship Between Maternal Blood Pressure During Pregnancy and Future Offspring Cardiometabolic Health
Observational epidemiological studies have reported that higher maternal blood pressure (BP) during pregnancy is associated with increased future risk of offspring cardiometabolic disease. However, it is unclear whether this association represents a causal relationship through intrauterine mechanisms. We used a Mendelian randomization (MR) framework to examine the relationship between unweighted maternal genetic scores for systolic BP and diastolic BP and a range of cardiometabolic risk factors in the offspring of up to 29 708 genotyped mother-offspring pairs from the UKB study (UK Biobank) and the HUNT study (Trøndelag Health). We conducted similar analyses in up to 21 423 father-offspring pairs from the same cohorts. We confirmed that the BP-associated genetic variants from the general population sample also had similar effects on maternal BP during pregnancy in independent cohorts. We did not detect any association between maternal (or paternal) unweighted genetic scores and cardiometabolic offspring outcomes in the meta-analysis of UKB and HUNT after adjusting for offspring genotypes at the same loci. We find little evidence to support the notion that maternal BP is a major causal risk factor for adverse offspring cardiometabolic outcomes in later life
- …