61 research outputs found
Customizable End-to-end Optimization of Online Neural Network-supported Dereverberation for Hearing Devices
This work focuses on online dereverberation for hearing devices using the
weighted prediction error (WPE) algorithm. WPE filtering requires an estimate
of the target speech power spectral density (PSD). Recently deep neural
networks (DNNs) have been used for this task. However, these approaches
optimize the PSD estimate which only indirectly affects the WPE output, thus
potentially resulting in limited dereverberation. In this paper, we propose an
end-to-end approach specialized for online processing, that directly optimizes
the dereverberated output signal. In addition, we propose to adapt it to the
needs of different types of hearing-device users by modifying the optimization
target as well as the WPE algorithm characteristics used in training. We show
that the proposed end-to-end approach outperforms the traditional and
conventional DNN-supported WPEs on a noise-free version of the WHAMR! dataset.Comment: \copyright 2022 IEEE. Personal use of this material is permitted.
Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising
or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of
this work in other work
Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition
Accurate recognition of cocktail party speech containing overlapping
speakers, noise and reverberation remains a highly challenging task to date.
Motivated by the invariance of visual modality to acoustic signal corruption,
an audio-visual multi-channel speech separation, dereverberation and
recognition approach featuring a full incorporation of visual information into
all system components is proposed in this paper. The efficacy of the video
input is consistently demonstrated in mask-based MVDR speech separation,
DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end and
Conformer ASR back-end. Audio-visual integrated front-end architectures
performing speech separation and dereverberation in a pipelined or joint
fashion via mask-based WPD are investigated. The error cost mismatch between
the speech enhancement front-end and ASR back-end components is minimized by
end-to-end jointly fine-tuning using either the ASR cost function alone, or its
interpolation with the speech enhancement loss. Experiments were conducted on
the mixture overlapped and reverberant speech data constructed using simulation
or replay of the Oxford LRS2 dataset. The proposed audio-visual multi-channel
speech separation, dereverberation and recognition systems consistently
outperformed the comparable audio-only baseline by 9.1% and 6.2% absolute
(41.7% and 36.0% relative) word error rate (WER) reductions. Consistent speech
enhancement improvements were also obtained on PESQ, STOI and SRMR scores.Comment: IEEE/ACM Transactions on Audio, Speech, and Language Processin
- …