364 research outputs found
BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR
The recently proposed serialized output training (SOT) simplifies
multi-talker automatic speech recognition (ASR) by generating speaker
transcriptions separated by a special token. However, frequent speaker changes
can make speaker change prediction difficult. To address this, we propose
boundary-aware serialized output training (BA-SOT), which explicitly
incorporates boundary knowledge into the decoder via a speaker change detection
task and boundary constraint loss. We also introduce a two-stage connectionist
temporal classification (CTC) strategy that incorporates token-level SOT CTC to
restore temporal context information. Besides typical character error rate
(CER), we introduce utterance-dependent character error rate (UD-CER) to
further measure the precision of speaker change prediction. Compared to
original SOT, BA-SOT reduces CER/UD-CER by 5.1%/14.0%, and leveraging a
pre-trained ASR model for BA-SOT model initialization further reduces
CER/UD-CER by 8.4%/19.9%.Comment: Accepted by INTERSPEECH 202
UNSSOR: Unsupervised Neural Speech Separation by Leveraging Over-determined Training Mixtures
In reverberant conditions with multiple concurrent speakers, each microphone
acquires a mixture signal of multiple speakers at a different location. In
over-determined conditions where the microphones out-number speakers, we can
narrow down the solutions to speaker images and realize unsupervised speech
separation by leveraging each mixture signal as a constraint (i.e., the
estimated speaker images at a microphone should add up to the mixture).
Equipped with this insight, we propose UNSSOR, an algorithm for
nsupervised eural peech
eparation by leveraging ver-determined training
mixtues. At each training step, we feed an input mixture to a deep
neural network (DNN) to produce an intermediate estimate for each speaker,
linearly filter the estimates, and optimize a loss so that, at each microphone,
the filtered estimates of all the speakers can add up to the mixture to satisfy
the above constraint. We show that this loss can promote unsupervised
separation of speakers. The linear filters are computed in each sub-band based
on the mixture and DNN estimates through the forward convolutive prediction
(FCP) algorithm. To address the frequency permutation problem incurred by using
sub-band FCP, a loss term based on minimizing intra-source magnitude scattering
is proposed. Although UNSSOR requires over-determined training mixtures, we can
train DNNs to achieve under-determined separation (e.g., unsupervised monaural
speech separation). Evaluation results on two-speaker separation in reverberant
conditions show the effectiveness and potential of UNSSOR.Comment: in submissio
Complete and separate: Conditional separation with missing target source attribute completion
Recent approaches in source separation leverage semantic information about
their input mixtures and constituent sources that when used in conditional
separation models can achieve impressive performance. Most approaches along
these lines have focused on simple descriptions, which are not always useful
for varying types of input mixtures. In this work, we present an approach in
which a model, given an input mixture and partial semantic information about a
target source, is trained to extract additional semantic data. We then leverage
this pre-trained model to improve the separation performance of an uncoupled
multi-conditional separation network. Our experiments demonstrate that the
separation performance of this multi-conditional model is significantly
improved, approaching the performance of an oracle model with complete semantic
information. Furthermore, our approach achieves performance levels that are
comparable to those of the best performing specialized single conditional
models, thus providing an easier to use alternative.Comment: Accepted to IEEE Workshop on Applications of Signal Processing to
Audio and Acoustics (WASPAA) 202
A Teacher-Student approach for extracting informative speaker embeddings from speech mixtures
We introduce a monaural neural speaker embeddings extractor that computes an
embedding for each speaker present in a speech mixture. To allow for supervised
training, a teacher-student approach is employed: the teacher computes the
target embeddings from each speaker's utterance before the utterances are added
to form the mixture, and the student embedding extractor is then tasked to
reproduce those embeddings from the speech mixture at its input. The system
much more reliably verifies the presence or absence of a given speaker in a
mixture than a conventional speaker embedding extractor, and even exhibits
comparable performance to a multi-channel approach that exploits spatial
information for embedding extraction. Further, it is shown that a speaker
embedding computed from a mixture can be used to check for the presence of that
speaker in another mixture.Comment: Accepted for Interspeech 202
ReZero: Region-customizable Sound Extraction
We introduce region-customizable sound extraction (ReZero), a general and
flexible framework for the multi-channel region-wise sound extraction (R-SE)
task. R-SE task aims at extracting all active target sounds (e.g., human
speech) within a specific, user-defined spatial region, which is different from
conventional and existing tasks where a blind separation or a fixed, predefined
spatial region are typically assumed. The spatial region can be defined as an
angular window, a sphere, a cone, or other geometric patterns. Being a solution
to the R-SE task, the proposed ReZero framework includes (1) definitions of
different types of spatial regions, (2) methods for region feature extraction
and aggregation, and (3) a multi-channel extension of the band-split RNN
(BSRNN) model specified for the R-SE task. We design experiments for different
microphone array geometries, different types of spatial regions, and
comprehensive ablation studies on different system configurations. Experimental
results on both simulated and real-recorded data demonstrate the effectiveness
of ReZero. Demos are available at https://innerselfm.github.io/rezero/.Comment: 13 pages, 11 figure
Gray Jedi MVDR Post-filtering
Spatial filters can exploit deep-learning-based speech enhancement models to
increase their reliability in scenarios with multiple speech sources scenarios.
To further improve speech quality, it is common to perform postfiltering on the
estimated target speech obtained with spatial filtering. In this work, Minimum
Variance Distortionless Response (MVDR) is employed to provide the interference
estimation, along with the estimation of the target speech, to be later used
for postfiltering. This improves the enhancement performance over a
single-input baseline in a far more significant way than by increasing the
model's complexity. Results suggest that less computing resources are required
for postfiltering when provided with both target and interference signals,
which is a step forward in developing an online speech enhancement system for
multi-speech scenarios.Comment: \c{opyright} 2023 IEEE. Personal use of this material is permitted.
Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising
or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of
this work in other work
Proceedings of the 8th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2023)
This volume gathers the papers presented at the Detection and Classification of Acoustic Scenes and Events 2023 Workshop (DCASE2023), Tampere, Finland, during 21–22 September 2023
RGI-Net: 3D Room Geometry Inference from Room Impulse Responses in the Absence of First-order Echoes
Room geometry is important prior information for implementing realistic 3D
audio rendering. For this reason, various room geometry inference (RGI) methods
have been developed by utilizing the time of arrival (TOA) or time difference
of arrival (TDOA) information in room impulse responses. However, the
conventional RGI technique poses several assumptions, such as convex room
shapes, the number of walls known in priori, and the visibility of first-order
reflections. In this work, we introduce the deep neural network (DNN), RGI-Net,
which can estimate room geometries without the aforementioned assumptions.
RGI-Net learns and exploits complex relationships between high-order
reflections in room impulse responses (RIRs) and, thus, can estimate room
shapes even when the shape is non-convex or first-order reflections are missing
in the RIRs. The network takes RIRs measured from a compact audio device
equipped with a circular microphone array and a single loudspeaker, which
greatly improves its practical applicability. RGI-Net includes the evaluation
network that separately evaluates the presence probability of walls, so the
geometry inference is possible without prior knowledge of the number of walls.Comment: 5 pages, 3 figures, 3 table
Locate and Beamform: Two-dimensional Locating All-neural Beamformer for Multi-channel Speech Separation
Recently, stunning improvements on multi-channel speech separation have been
achieved by neural beamformers when direction information is available.
However, most of them neglect to utilize speaker's 2-dimensional (2D) location
cues contained in mixture signal, which limits the performance when two sources
come from close directions. In this paper, we propose an end-to-end beamforming
network for 2D location guided speech separation merely given mixture signal.
It first estimates discriminable direction and 2D location cues, which imply
directions the sources come from in multi views of microphones and their 2D
coordinates. These cues are then integrated into location-aware neural
beamformer, thus allowing accurate reconstruction of two sources' speech
signals. Experiments show that our proposed model not only achieves a
comprehensive decent improvement compared to baseline systems, but avoids
inferior performance on spatial overlapping cases.Comment: Accepted by Interspeech 2023. arXiv admin note: substantial text
overlap with arXiv:2212.0340
- …