19 research outputs found
Naaloss: Rethinking the objective of speech enhancement
Reducing noise interference is crucial for automatic speech recognition (ASR)
in a real-world scenario. However, most single-channel speech enhancement (SE)
generates "processing artifacts" that negatively affect ASR performance. Hence,
in this study, we suggest a Noise- and Artifacts-aware loss function, NAaLoss,
to ameliorate the influence of artifacts from a novel perspective. NAaLoss
considers the loss of estimation, de-artifact, and noise ignorance, enabling
the learned SE to individually model speech, artifacts, and noise. We examine
two SE models (simple/advanced) learned with NAaLoss under various input
scenarios (clean/noisy) using two configurations of the ASR system
(with/without noise robustness). Experiments reveal that NAaLoss significantly
improves the ASR performance of most setups while preserving the quality of SE
toward perception and intelligibility. Furthermore, we visualize artifacts
through waveforms and spectrograms, and explain their impact on ASR
Comparative analysis for datadriven temporal filters obtained via principal component analysis (PCA) and linear discriminant analysis (LDA) in speech recognition
Abstract The Linear Discriminant Analysis (LDA) has been widely used to derive the data-driven temporal filtering of speech feature vectors. In this paper, we proposed that the Principal Component Analysis (PCA) can also be used in the optimization process just as LDA to obtain the temporal filters, and detailed comparative analysis between these two approaches are presented and discussed. It's found that the PCA-derived temporal filters significantly improve the recognition performance of the original MFCC features as LDA-derived filters do. Also, while PCA/LDA filters are combined with the conventional temporal filters, RASTA or CMS, the recognition performance will be further improved regardless the training and testing environments are matched or mismatched, compressed or noise corrupted
Time-Domain Multi-modal Bone/air Conducted Speech Enhancement
Previous studies have proven that integrating video signals, as a
complementary modality, can facilitate improved performance for speech
enhancement (SE). However, video clips usually contain large amounts of data
and pose a high cost in terms of computational resources and thus may
complicate the SE system. As an alternative source, a bone-conducted speech
signal has a moderate data size while manifesting speech-phoneme structures,
and thus complements its air-conducted counterpart. In this study, we propose a
novel multi-modal SE structure in the time domain that leverages bone- and
air-conducted signals. In addition, we examine two ensemble-learning-based
strategies, early fusion (EF) and late fusion (LF), to integrate the two types
of speech signals, and adopt a deep learning-based fully convolutional network
to conduct the enhancement. The experiment results on the Mandarin corpus
indicate that this newly presented multi-modal (integrating bone- and
air-conducted signals) SE structure significantly outperforms the single-source
SE counterparts (with a bone- or air-conducted signal only) in various speech
evaluation metrics. In addition, the adoption of an LF strategy other than an
EF in this novel SE multi-modal structure achieves better results.Comment: multi-modal, bone/air-conducted signals, speech enhancement, fully
convolutional networ
Speech Enhancement Guided by Contextual Articulatory Information
Previous studies have confirmed the effectiveness of leveraging articulatory
information to attain improved speech enhancement (SE) performance. By
augmenting the original acoustic features with the place/manner of articulatory
features, the SE process can be guided to consider the articulatory properties
of the input speech when performing enhancement. Hence, we believe that the
contextual information of articulatory attributes should include useful
information and can further benefit SE in different languages. In this study,
we propose an SE system that improves its performance through optimizing the
contextual articulatory information in enhanced speech for both English and
Mandarin. We optimize the contextual articulatory information through
joint-train the SE model with an end-to-end automatic speech recognition (E2E
ASR) model, predicting the sequence of broad phone classes (BPC) instead of the
word sequences. Meanwhile, two training strategies are developed to train the
SE system based on the BPC-based ASR: multitask-learning and deep-feature
training strategies. Experimental results on the TIMIT and TMHINT dataset
confirm that the contextual articulatory information facilitates an SE system
in achieving better results than the traditional Acoustic Model(AM). Moreover,
in contrast to another SE system that is trained with monophonic ASR, the
BPC-based ASR (providing contextual articulatory information) can improve the
SE performance more effectively under different signal-to-noise ratios(SNR).Comment: Will be submitted to TASL