14 research outputs found
Attention-based distributed speech enhancement for unconstrained microphone arrays with varying number of nodes
Speech enhancement promises higher efficiency in ad-hoc microphone arrays
than in constrained microphone arrays thanks to the wide spatial coverage of
the devices in the acoustic scene. However, speech enhancement in ad-hoc
microphone arrays still raises many challenges. In particular, the algorithms
should be able to handle a variable number of microphones, as some devices in
the array might appear or disappear. In this paper, we propose a solution that
can efficiently process the spatial information captured by the different
devices of the microphone array, while being robust to a link failure. To do
this, we use an attention mechanism in order to put more weight on the relevant
signals sent throughout the array and to neglect the redundant or empty
channels
Learning to Rank Microphones for Distant Speech Recognition
Fully exploiting ad-hoc microphone networks for distant speech recognition is
still an open issue. Empirical evidence shows that being able to select the
best microphone leads to significant improvements in recognition without any
additional effort on front-end processing. Current channel selection techniques
either rely on signal, decoder or posterior-based features. Signal-based
features are inexpensive to compute but do not always correlate with
recognition performance. Instead decoder and posterior-based features exhibit
better correlation but require substantial computational resources. In this
work, we tackle the channel selection problem by proposing MicRank, a learning
to rank framework where a neural network is trained to rank the available
channels using directly the recognition performance on the training set. The
proposed approach is agnostic with respect to the array geometry and type of
recognition back-end. We investigate different learning to rank strategies
using a synthetic dataset developed on purpose and the CHiME-6 data. Results
show that the proposed approach is able to considerably improve over previous
selection techniques, reaching comparable and in some instances better
performance than oracle signal-based measures
Low bit rate binaural link for improved ultra low-latency low-complexity multichannel speech enhancement in Hearing Aids
Speech enhancement in hearing aids is a challenging task since the hardware
limits the number of possible operations and the latency needs to be in the
range of only a few milliseconds. We propose a deep-learning model compatible
with these limitations, which we refer to as Group-Communication Filter-and-Sum
Network (GCFSnet). GCFSnet is a causal multiple-input single output enhancement
model using filter-and-sum processing in the time-frequency domain and a
multi-frame deep post filter. All filters are complex-valued and are estimated
by a deep-learning model using weight-sharing through Group Communication and
quantization-aware training for reducing model size and computational
footprint. For a further increase in performance, a low bit rate binaural link
for delayed binaural features is proposed to use binaural information while
retaining a latency of 2ms. The performance of an oracle binaural LCMV
beamformer in non-low-latency configuration can be matched even by a unilateral
configuration of the GCFSnet in terms of objective metrics.Comment: Accepted at WASPAA 202
DeFT-AN: Dense Frequency-Time Attentive Network for Multichannel Speech Enhancement
In this study, we propose a dense frequency-time attentive network (DeFT-AN)
for multichannel speech enhancement. DeFT-AN is a mask estimation network that
predicts a complex spectral masking pattern for suppressing the noise and
reverberation embedded in the short-time Fourier transform (STFT) of an input
signal. The proposed mask estimation network incorporates three different types
of blocks for aggregating information in the spatial, spectral, and temporal
dimensions. It utilizes a spectral transformer with a modified feed-forward
network and a temporal conformer with sequential dilated convolutions. The use
of dense blocks and transformers dedicated to the three different
characteristics of audio signals enables more comprehensive enhancement in
noisy and reverberant environments. The remarkable performance of DeFT-AN over
state-of-the-art multichannel models is demonstrated based on two popular noisy
and reverberant datasets in terms of various metrics for speech quality and
intelligibility.Comment: 5 pages, 2 figures, 3 table