1,569 research outputs found
Objective Assessment of Machine Learning Algorithms for Speech Enhancement in Hearing Aids
Speech enhancement in assistive hearing devices has been an area of research for many decades. Noise reduction is particularly challenging because of the wide variety of noise sources and the non-stationarity of speech and noise. Digital signal processing (DSP) algorithms deployed in modern hearing aids for noise reduction rely on certain assumptions on the statistical properties of undesired signals. This could be disadvantageous in accurate estimation of different noise types, which subsequently leads to suboptimal noise reduction. In this research, a relatively unexplored technique based on deep learning, i.e. Recurrent Neural Network (RNN), is used to perform noise reduction and dereverberation for assisting hearing-impaired listeners. For noise reduction, the performance of the deep learning model was evaluated objectively and compared with that of open Master Hearing Aid (openMHA), a conventional signal processing based framework, and a Deep Neural Network (DNN) based model. It was found that the RNN model can suppress noise and improve speech understanding better than the conventional hearing aid noise reduction algorithm and the DNN model. The same RNN model was shown to reduce reverberation components with proper training. A real-time implementation of the deep learning model is also discussed
ICASSP 2023 Acoustic Echo Cancellation Challenge
The ICASSP 2023 Acoustic Echo Cancellation Challenge is intended to stimulate
research in acoustic echo cancellation (AEC), which is an important area of
speech enhancement and is still a top issue in audio communication. This is the
fourth AEC challenge and it is enhanced by adding a second track for
personalized acoustic echo cancellation, reducing the algorithmic + buffering
latency to 20ms, as well as including a full-band version of AECMOS. We open
source two large datasets to train AEC models under both single talk and double
talk scenarios. These datasets consist of recordings from more than 10,000 real
audio devices and human speakers in real environments, as well as a synthetic
dataset. We open source an online subjective test framework and provide an
objective metric for researchers to quickly test their results. The winners of
this challenge were selected based on the average mean opinion score (MOS)
achieved across all scenarios and the word accuracy (WAcc) rate.Comment: arXiv admin note: substantial text overlap with arXiv:2202.13290,
arXiv:2009.0497
Deep speech inpainting of time-frequency masks
Transient loud intrusions, often occurring in noisy environments, can
completely overpower speech signal and lead to an inevitable loss of
information. While existing algorithms for noise suppression can yield
impressive results, their efficacy remains limited for very low signal-to-noise
ratios or when parts of the signal are missing. To address these limitations,
here we propose an end-to-end framework for speech inpainting, the
context-based retrieval of missing or severely distorted parts of
time-frequency representation of speech. The framework is based on a
convolutional U-Net trained via deep feature losses, obtained using speechVGG,
a deep speech feature extractor pre-trained on an auxiliary word classification
task. Our evaluation results demonstrate that the proposed framework can
recover large portions of missing or distorted time-frequency representation of
speech, up to 400 ms and 3.2 kHz in bandwidth. In particular, our approach
provided a substantial increase in STOI & PESQ objective metrics of the
initially corrupted speech samples. Notably, using deep feature losses to train
the framework led to the best results, as compared to conventional approaches.Comment: Accepted to InterSpeech202
On real-time multi-stage speech enhancement systems
Recently, multi-stage systems have stood out among deep learning-based speech
enhancement methods. However, these systems are always high in complexity,
requiring millions of parameters and powerful computational resources, which
limits their application for real-time processing in low-power devices.
Besides, the contribution of various influencing factors to the success of
multi-stage systems remains unclear, which presents challenges to reduce the
size of these systems. In this paper, we extensively investigate a lightweight
two-stage network with only 560k total parameters. It consists of a Mel-scale
magnitude masking model in the first stage and a complex spectrum mapping model
in the second stage. We first provide a consolidated view of the roles of gain
power factor, post-filter, and training labels for the Mel-scale masking model.
Then, we explore several training schemes for the two-stage network and provide
some insights into the superiority of the two-stage network. We show that the
proposed two-stage network trained by an optimal scheme achieves a performance
similar to a four times larger open source model DeepFilterNet2.Comment: To appear at ICASSP 202
Noise-Robust DSP-Assisted Neural Pitch Estimation with Very Low Complexity
Pitch estimation is an essential step of many speech processing algorithms,
including speech coding, synthesis, and enhancement. Recently, pitch estimators
based on deep neural networks (DNNs) have have been outperforming
well-established DSP-based techniques. Unfortunately, these new estimators can
be impractical to deploy in real-time systems, both because of their relatively
high complexity, and the fact that some require significant lookahead. We show
that a hybrid estimator using a small deep neural network (DNN) with
traditional DSP-based features can match or exceed the performance of pure
DNN-based models, with a complexity and algorithmic delay comparable to
traditional DSP-based algorithms. We further demonstrate that this hybrid
approach can provide benefits for a neural vocoding task.Comment: Submitted to ICASSP 2024, 5 page
Efficient Monaural Speech Enhancement using Spectrum Attention Fusion
Speech enhancement is a demanding task in automated speech processing
pipelines, focusing on separating clean speech from noisy channels. Transformer
based models have recently bested RNN and CNN models in speech enhancement,
however at the same time they are much more computationally expensive and
require much more high quality training data, which is always hard to come by.
In this paper, we present an improvement for speech enhancement models that
maintains the expressiveness of self-attention while significantly reducing
model complexity, which we have termed Spectrum Attention Fusion. We carefully
construct a convolutional module to replace several self-attention layers in a
speech Transformer, allowing the model to more efficiently fuse spectral
features. Our proposed model is able to achieve comparable or better results
against SOTA models but with significantly smaller parameters (0.58M) on the
Voice Bank + DEMAND dataset
CheapNET: Improving Light-weight speech enhancement network by projected loss function
Noise suppression and echo cancellation are critical in speech enhancement
and essential for smart devices and real-time communication. Deployed in voice
processing front-ends and edge devices, these algorithms must ensure efficient
real-time inference with low computational demands. Traditional edge-based
noise suppression often uses MSE-based amplitude spectrum mask training, but
this approach has limitations. We introduce a novel projection loss function,
diverging from MSE, to enhance noise suppression. This method uses projection
techniques to isolate key audio components from noise, significantly improving
model performance. For echo cancellation, the function enables direct
predictions on LAEC pre-processed outputs, substantially enhancing performance.
Our noise suppression model achieves near state-of-the-art results with only
3.1M parameters and 0.4GFlops/s computational load. Moreover, our echo
cancellation model outperforms replicated industry-leading models, introducing
a new perspective in speech enhancement
- …