29 research outputs found
Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition
Accurate recognition of cocktail party speech containing overlapping
speakers, noise and reverberation remains a highly challenging task to date.
Motivated by the invariance of visual modality to acoustic signal corruption,
an audio-visual multi-channel speech separation, dereverberation and
recognition approach featuring a full incorporation of visual information into
all system components is proposed in this paper. The efficacy of the video
input is consistently demonstrated in mask-based MVDR speech separation,
DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end and
Conformer ASR back-end. Audio-visual integrated front-end architectures
performing speech separation and dereverberation in a pipelined or joint
fashion via mask-based WPD are investigated. The error cost mismatch between
the speech enhancement front-end and ASR back-end components is minimized by
end-to-end jointly fine-tuning using either the ASR cost function alone, or its
interpolation with the speech enhancement loss. Experiments were conducted on
the mixture overlapped and reverberant speech data constructed using simulation
or replay of the Oxford LRS2 dataset. The proposed audio-visual multi-channel
speech separation, dereverberation and recognition systems consistently
outperformed the comparable audio-only baseline by 9.1% and 6.2% absolute
(41.7% and 36.0% relative) word error rate (WER) reductions. Consistent speech
enhancement improvements were also obtained on PESQ, STOI and SRMR scores.Comment: IEEE/ACM Transactions on Audio, Speech, and Language Processin
Time-domain Ad-hoc Array Speech Enhancement Using a Triple-path Network
Deep neural networks (DNNs) are very effective for multichannel speech
enhancement with fixed array geometries. However, it is not trivial to use DNNs
for ad-hoc arrays with unknown order and placement of microphones. We propose a
novel triple-path network for ad-hoc array processing in the time domain. The
key idea in the network design is to divide the overall processing into spatial
processing and temporal processing and use self-attention for spatial
processing. Using self-attention for spatial processing makes the network
invariant to the order and the number of microphones. The temporal processing
is done independently for all channels using a recently proposed dual-path
attentive recurrent network. The proposed network is a multiple-input
multiple-output architecture that can simultaneously enhance signals at all
microphones. Experimental results demonstrate the excellent performance of the
proposed approach. Further, we present analysis to demonstrate the
effectiveness of the proposed network in utilizing multichannel information
even from microphones at far locations.Comment: Accepted for publication in INTERSPEECH 202
Towards Unified All-Neural Beamforming for Time and Frequency Domain Speech Separation
Recently, frequency domain all-neural beamforming methods have achieved
remarkable progress for multichannel speech separation. In parallel, the
integration of time domain network structure and beamforming also gains
significant attention. This study proposes a novel all-neural beamforming
method in time domain and makes an attempt to unify the all-neural beamforming
pipelines for time domain and frequency domain multichannel speech separation.
The proposed model consists of two modules: separation and beamforming. Both
modules perform temporal-spectral-spatial modeling and are trained from
end-to-end using a joint loss function. The novelty of this study lies in two
folds. Firstly, a time domain directional feature conditioned on the direction
of the target speaker is proposed, which can be jointly optimized within the
time domain architecture to enhance target signal estimation. Secondly, an
all-neural beamforming network in time domain is designed to refine the
pre-separated results. This module features with parametric time-variant
beamforming coefficient estimation, without explicitly following the derivation
of optimal filters that may lead to an upper bound. The proposed method is
evaluated on simulated reverberant overlapped speech data derived from the
AISHELL-1 corpus. Experimental results demonstrate significant performance
improvements over frequency domain state-of-the-arts, ideal magnitude masks and
existing time domain neural beamforming methods