173,395 research outputs found
Using audio and visual information for single channel speaker separation
This work proposes a method to exploit both audio and vi- sual speech information to extract a target speaker from a mix- ture of competing speakers. The work begins by taking an ef- fective audio-only method of speaker separation, namely the soft mask method, and modifying its operation to allow visual speech information to improve the separation process. The au- dio input is taken from a single channel and includes the mix- ture of speakers, where as a separate set of visual features are extracted from each speaker. This allows modification of the separation process to include not only the audio speech but also visual speech from each speaker in the mixture. Experimen- tal results are presented that compare the proposed audio-visual speaker separation with audio-only and visual-only methods us- ing both speech quality and speech intelligibility metrics
FaceFilter: Audio-visual speech separation using still images
The objective of this paper is to separate a target speaker's speech from a
mixture of two speakers using a deep audio-visual speech separation network.
Unlike previous works that used lip movement on video clips or pre-enrolled
speaker information as an auxiliary conditional feature, we use a single face
image of the target speaker. In this task, the conditional feature is obtained
from facial appearance in cross-modal biometric task, where audio and visual
identity representations are shared in latent space. Learnt identities from
facial images enforce the network to isolate matched speakers and extract the
voices from mixed speech. It solves the permutation problem caused by swapped
channel outputs, frequently occurred in speech separation tasks. The proposed
method is far more practical than video-based speech separation since user
profile images are readily available on many platforms. Also, unlike
speaker-aware separation methods, it is applicable on separation with unseen
speakers who have never been enrolled before. We show strong qualitative and
quantitative results on challenging real-world examples.Comment: Under submission as a conference paper. Video examples:
https://youtu.be/ku9xoLh62
Multi-talker Speech Separation with Utterance-level Permutation Invariant Training of Deep Recurrent Neural Networks
In this paper we propose the utterance-level Permutation Invariant Training
(uPIT) technique. uPIT is a practically applicable, end-to-end, deep learning
based solution for speaker independent multi-talker speech separation.
Specifically, uPIT extends the recently proposed Permutation Invariant Training
(PIT) technique with an utterance-level cost function, hence eliminating the
need for solving an additional permutation problem during inference, which is
otherwise required by frame-level PIT. We achieve this using Recurrent Neural
Networks (RNNs) that, during training, minimize the utterance-level separation
error, hence forcing separated frames belonging to the same speaker to be
aligned to the same output stream. In practice, this allows RNNs, trained with
uPIT, to separate multi-talker mixed speech without any prior knowledge of
signal duration, number of speakers, speaker identity or gender. We evaluated
uPIT on the WSJ0 and Danish two- and three-talker mixed-speech separation tasks
and found that uPIT outperforms techniques based on Non-negative Matrix
Factorization (NMF) and Computational Auditory Scene Analysis (CASA), and
compares favorably with Deep Clustering (DPCL) and the Deep Attractor Network
(DANet). Furthermore, we found that models trained with uPIT generalize well to
unseen speakers and languages. Finally, we found that a single model, trained
with uPIT, can handle both two-speaker, and three-speaker speech mixtures
Jointly Tracking and Separating Speech Sources Using Multiple Features and the generalized labeled multi-Bernoulli Framework
This paper proposes a novel joint multi-speaker tracking-and-separation
method based on the generalized labeled multi-Bernoulli (GLMB) multi-target
tracking filter, using sound mixtures recorded by microphones. Standard
multi-speaker tracking algorithms usually only track speaker locations, and
ambiguity occurs when speakers are spatially close. The proposed multi-feature
GLMB tracking filter treats the set of vectors of associated speaker features
(location, pitch and sound) as the multi-target multi-feature observation,
characterizes transitioning features with corresponding transition models and
overall likelihood function, thus jointly tracks and separates each
multi-feature speaker, and addresses the spatial ambiguity problem. Numerical
evaluation verifies that the proposed method can correctly track locations of
multiple speakers and meanwhile separate speech signals
- …