42,216 research outputs found
Deep Variational Generative Models for Audio-visual Speech Separation
In this paper, we are interested in audio-visual speech separation given a
single-channel audio recording as well as visual information (lips movements)
associated with each speaker. We propose an unsupervised technique based on
audio-visual generative modeling of clean speech. More specifically, during
training, a latent variable generative model is learned from clean speech
spectrograms using a variational auto-encoder (VAE). To better utilize the
visual information, the posteriors of the latent variables are inferred from
mixed speech (instead of clean speech) as well as the visual data. The visual
modality also serves as a prior for latent variables, through a visual network.
At test time, the learned generative model (both for speaker-independent and
speaker-dependent scenarios) is combined with an unsupervised non-negative
matrix factorization (NMF) variance model for background noise. All the latent
variables and noise parameters are then estimated by a Monte Carlo
expectation-maximization algorithm. Our experiments show that the proposed
unsupervised VAE-based method yields better separation performance than
NMF-based approaches as well as a supervised deep learning-based technique
Multi-channel Conversational Speaker Separation via Neural Diarization
When dealing with overlapped speech, the performance of automatic speech
recognition (ASR) systems substantially degrades as they are designed for
single-talker speech. To enhance ASR performance in conversational or meeting
environments, continuous speaker separation (CSS) is commonly employed.
However, CSS requires a short separation window to avoid many speakers inside
the window and sequential grouping of discontinuous speech segments. To address
these limitations, we introduce a new multi-channel framework called "speaker
separation via neural diarization" (SSND) for meeting environments. Our
approach utilizes an end-to-end diarization system to identify the speech
activity of each individual speaker. By leveraging estimated speaker
boundaries, we generate a sequence of embeddings, which in turn facilitate the
assignment of speakers to the outputs of a multi-talker separation model. SSND
addresses the permutation ambiguity issue of talker-independent speaker
separation during the diarization phase through location-based training, rather
than during the separation process. This unique approach allows multiple
non-overlapped speakers to be assigned to the same output stream, making it
possible to efficiently process long segments-a task impossible with CSS.
Additionally, SSND is naturally suitable for speaker-attributed ASR. We
evaluate our proposed diarization and separation methods on the open LibriCSS
dataset, advancing state-of-the-art diarization and ASR results by a large
margin.Comment: 10 pages, 4 figure
- …