3 research outputs found
Time-Domain Multi-modal Bone/air Conducted Speech Enhancement
Previous studies have proven that integrating video signals, as a
complementary modality, can facilitate improved performance for speech
enhancement (SE). However, video clips usually contain large amounts of data
and pose a high cost in terms of computational resources and thus may
complicate the SE system. As an alternative source, a bone-conducted speech
signal has a moderate data size while manifesting speech-phoneme structures,
and thus complements its air-conducted counterpart. In this study, we propose a
novel multi-modal SE structure in the time domain that leverages bone- and
air-conducted signals. In addition, we examine two ensemble-learning-based
strategies, early fusion (EF) and late fusion (LF), to integrate the two types
of speech signals, and adopt a deep learning-based fully convolutional network
to conduct the enhancement. The experiment results on the Mandarin corpus
indicate that this newly presented multi-modal (integrating bone- and
air-conducted signals) SE structure significantly outperforms the single-source
SE counterparts (with a bone- or air-conducted signal only) in various speech
evaluation metrics. In addition, the adoption of an LF strategy other than an
EF in this novel SE multi-modal structure achieves better results.Comment: multi-modal, bone/air-conducted signals, speech enhancement, fully
convolutional networ
Exploring Deep Hybrid Tensor-to-Vector Network Architectures for Regression Based Speech Enhancement
This paper investigates different trade-offs between the number of model
parameters and enhanced speech qualities by employing several deep
tensor-to-vector regression models for speech enhancement. We find that a
hybrid architecture, namely CNN-TT, is capable of maintaining a good quality
performance with a reduced model parameter size. CNN-TT is composed of several
convolutional layers at the bottom for feature extraction to improve speech
quality and a tensor-train (TT) output layer on the top to reduce model
parameters. We first derive a new upper bound on the generalization power of
the convolutional neural network (CNN) based vector-to-vector regression
models. Then, we provide experimental evidence on the Edinburgh noisy speech
corpus to demonstrate that, in single-channel speech enhancement, CNN
outperforms DNN at the expense of a small increment of model sizes. Besides,
CNN-TT slightly outperforms the CNN counterpart by utilizing only 32\% of the
CNN model parameters. Besides, further performance improvement can be attained
if the number of CNN-TT parameters is increased to 44\% of the CNN model size.
Finally, our experiments of multi-channel speech enhancement on a simulated
noisy WSJ0 corpus demonstrate that our proposed hybrid CNN-TT architecture
achieves better results than both DNN and CNN models in terms of
better-enhanced speech qualities and smaller parameter sizes.Comment: Accepted to InterSpeech 202