242 research outputs found
Configurable EBEN: Extreme Bandwidth Extension Network to enhance body-conducted speech capture
This paper presents a configurable version of Extreme Bandwidth Extension
Network (EBEN), a Generative Adversarial Network (GAN) designed to improve
audio captured with body-conduction microphones. We show that although these
microphones significantly reduce environmental noise, this insensitivity to
ambient noise happens at the expense of the bandwidth of the speech signal
acquired by the wearer of the devices. The obtained captured signals therefore
require the use of signal enhancement techniques to recover the full-bandwidth
speech. EBEN leverages a configurable multiband decomposition of the raw
captured signal. This decomposition allows the data time domain dimensions to
be reduced and the full band signal to be better controlled. The multiband
representation of the captured signal is processed through a U-Net-like model,
which combines feature and adversarial losses to generate an enhanced speech
signal. We also benefit from this original representation in the proposed
configurable discriminators architecture. The configurable EBEN approach can
achieve state-of-the-art enhancement results on synthetic data with a
lightweight generator that allows real-time processing.Comment: Accepted in IEEE/ACM Transactions on Audio, Speech and Language
Processing on 14/08/202
Speaker Re-identification with Speaker Dependent Speech Enhancement
While the use of deep neural networks has significantly boosted speaker
recognition performance, it is still challenging to separate speakers in poor
acoustic environments. Here speech enhancement methods have traditionally
allowed improved performance. The recent works have shown that adapting speech
enhancement can lead to further gains. This paper introduces a novel approach
that cascades speech enhancement and speaker recognition. In the first step, a
speaker embedding vector is generated , which is used in the second step to
enhance the speech quality and re-identify the speakers. Models are trained in
an integrated framework with joint optimisation. The proposed approach is
evaluated using the Voxceleb1 dataset, which aims to assess speaker recognition
in real world situations. In addition three types of noise at different
signal-noise-ratios were added for this work. The obtained results show that
the proposed approach using speaker dependent speech enhancement can yield
better speaker recognition and speech enhancement performances than two
baselines in various noise conditions.Comment: Acceptted for presentation at Interspeech202
Continuous Modeling of the Denoising Process for Speech Enhancement Based on Deep Learning
In this paper, we explore a continuous modeling approach for
deep-learning-based speech enhancement, focusing on the denoising process. We
use a state variable to indicate the denoising process. The starting state is
noisy speech and the ending state is clean speech. The noise component in the
state variable decreases with the change of the state index until the noise
component is 0. During training, a UNet-like neural network learns to estimate
every state variable sampled from the continuous denoising process. In testing,
we introduce a controlling factor as an embedding, ranging from zero to one, to
the neural network, allowing us to control the level of noise reduction. This
approach enables controllable speech enhancement and is adaptable to various
application scenarios. Experimental results indicate that preserving a small
amount of noise in the clean target benefits speech enhancement, as evidenced
by improvements in both objective speech measures and automatic speech
recognition performance
Self-Training for End-to-End Speech Recognition
We revisit self-training in the context of end-to-end speech recognition. We
demonstrate that training with pseudo-labels can substantially improve the
accuracy of a baseline model. Key to our approach are a strong baseline
acoustic and language model used to generate the pseudo-labels, filtering
mechanisms tailored to common errors from sequence-to-sequence models, and a
novel ensemble approach to increase pseudo-label diversity. Experiments on the
LibriSpeech corpus show that with an ensemble of four models and label
filtering, self-training yields a 33.9% relative improvement in WER compared
with a baseline trained on 100 hours of labelled data in the noisy speech
setting. In the clean speech setting, self-training recovers 59.3% of the gap
between the baseline and an oracle model, which is at least 93.8% relatively
higher than what previous approaches can achieve.Comment: To be published in the 45th IEEE International Conference on
Acoustics, Speech, and Signal Processing (ICASSP) 202
Power scalable implementation of artificial neural networks
As the use of Artificial Neural Network (ANN) in mobile embedded devices gets more pervasive, power consumption of ANN hardware is becoming a major limiting factor. Although considerable research efforts are now directed towards low-power implementations of ANN, the issue of dynamic power scalability of the implemented design has been largely overlooked. In this paper, we discuss the motivation and basic principles for implementing power scaling in ANN Hardware. With the help of a simple example, we demonstrate how power scaling can be achieved with dynamic pruning techniques
End-to-end speech enhancement based on discrete cosine transform
Previous speech enhancement methods focus on estimating the short-time
spectrum of speech signals due to its short-term stability. However, these
methods often only estimate the clean magnitude spectrum and reuse the noisy
phase when resynthesize speech signals, which is unlikely a valid short-time
Fourier transform (STFT). Recently, DNN based speech enhancement methods mainly
joint estimation of the magnitude and phase spectrum. These methods usually
give better performance than magnitude spectrum estimation but need much larger
computation and memory overhead. In this paper, we propose using the Discrete
Cosine Transform (DCT) to reconstruct a valid short-time spectrum. Under the
U-net structure, we enhance the real spectrogram and finally achieve perfect
performance.Comment: 5 pages, 5 figures, ICASSP 202
Speech Enhancement for Virtual Meetings on Cellular Networks
We study speech enhancement using deep learning (DL) for virtual meetings on
cellular devices, where transmitted speech has background noise and
transmission loss that affects speech quality. Since the Deep Noise Suppression
(DNS) Challenge dataset does not contain practical disturbance, we collect a
transmitted DNS (t-DNS) dataset using Zoom Meetings over T-Mobile network. We
select two baseline models: Demucs and FullSubNet. The Demucs is an end-to-end
model that takes time-domain inputs and outputs time-domain denoised speech,
and the FullSubNet takes time-frequency-domain inputs and outputs the energy
ratio of the target speech in the inputs. The goal of this project is to
enhance the speech transmitted over the cellular networks using deep learning
models
- …