20 research outputs found
ReZero: Region-customizable Sound Extraction
We introduce region-customizable sound extraction (ReZero), a general and
flexible framework for the multi-channel region-wise sound extraction (R-SE)
task. R-SE task aims at extracting all active target sounds (e.g., human
speech) within a specific, user-defined spatial region, which is different from
conventional and existing tasks where a blind separation or a fixed, predefined
spatial region are typically assumed. The spatial region can be defined as an
angular window, a sphere, a cone, or other geometric patterns. Being a solution
to the R-SE task, the proposed ReZero framework includes (1) definitions of
different types of spatial regions, (2) methods for region feature extraction
and aggregation, and (3) a multi-channel extension of the band-split RNN
(BSRNN) model specified for the R-SE task. We design experiments for different
microphone array geometries, different types of spatial regions, and
comprehensive ablation studies on different system configurations. Experimental
results on both simulated and real-recorded data demonstrate the effectiveness
of ReZero. Demos are available at https://innerselfm.github.io/rezero/.Comment: 13 pages, 11 figure
Towards Unified All-Neural Beamforming for Time and Frequency Domain Speech Separation
Recently, frequency domain all-neural beamforming methods have achieved
remarkable progress for multichannel speech separation. In parallel, the
integration of time domain network structure and beamforming also gains
significant attention. This study proposes a novel all-neural beamforming
method in time domain and makes an attempt to unify the all-neural beamforming
pipelines for time domain and frequency domain multichannel speech separation.
The proposed model consists of two modules: separation and beamforming. Both
modules perform temporal-spectral-spatial modeling and are trained from
end-to-end using a joint loss function. The novelty of this study lies in two
folds. Firstly, a time domain directional feature conditioned on the direction
of the target speaker is proposed, which can be jointly optimized within the
time domain architecture to enhance target signal estimation. Secondly, an
all-neural beamforming network in time domain is designed to refine the
pre-separated results. This module features with parametric time-variant
beamforming coefficient estimation, without explicitly following the derivation
of optimal filters that may lead to an upper bound. The proposed method is
evaluated on simulated reverberant overlapped speech data derived from the
AISHELL-1 corpus. Experimental results demonstrate significant performance
improvements over frequency domain state-of-the-arts, ideal magnitude masks and
existing time domain neural beamforming methods
Ultra Dual-Path Compression For Joint Echo Cancellation And Noise Suppression
Echo cancellation and noise reduction are essential for full-duplex
communication, yet most existing neural networks have high computational costs
and are inflexible in tuning model complexity. In this paper, we introduce
time-frequency dual-path compression to achieve a wide range of compression
ratios on computational cost. Specifically, for frequency compression,
trainable filters are used to replace manually designed filters for dimension
reduction. For time compression, only using frame skipped prediction causes
large performance degradation, which can be alleviated by a post-processing
network with full sequence modeling. We have found that under fixed compression
ratios, dual-path compression combining both the time and frequency methods
will give further performance improvement, covering compression ratios from 4x
to 32x with little model size change. Moreover, the proposed models show
competitive performance compared with fast FullSubNet and DeepFilterNet. A demo
page can be found at
hangtingchen.github.io/ultra_dual_path_compression.github.io/.Comment: Accepted by Interspeech 202