675 research outputs found
Improving Source Separation via Multi-Speaker Representations
Lately there have been novel developments in deep learning towards solving
the cocktail party problem. Initial results are very promising and allow for
more research in the domain. One technique that has not yet been explored in
the neural network approach to this task is speaker adaptation. Intuitively,
information on the speakers that we are trying to separate seems fundamentally
important for the speaker separation task. However, retrieving this speaker
information is challenging since the speaker identities are not known a priori
and multiple speakers are simultaneously active. There is thus some sort of
chicken and egg problem. To tackle this, source signals and i-vectors are
estimated alternately. We show that blind multi-speaker adaptation improves the
results of the network and that (in our case) the network is not capable of
adequately retrieving this useful speaker information itself
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure
Supervised Speaker Embedding De-Mixing in Two-Speaker Environment
Separating different speaker properties from a multi-speaker environment is
challenging. Instead of separating a two-speaker signal in signal space like
speech source separation, a speaker embedding de-mixing approach is proposed.
The proposed approach separates different speaker properties from a two-speaker
signal in embedding space. The proposed approach contains two steps. In step
one, the clean speaker embeddings are learned and collected by a residual TDNN
based network. In step two, the two-speaker signal and the embedding of one of
the speakers are both input to a speaker embedding de-mixing network. The
de-mixing network is trained to generate the embedding of the other speaker by
reconstruction loss. Speaker identification accuracy and the cosine similarity
score between the clean embeddings and the de-mixed embeddings are used to
evaluate the quality of the obtained embeddings. Experiments are done in two
kind of data: artificial augmented two-speaker data (TIMIT) and real world
recording of two-speaker data (MC-WSJ). Six different speaker embedding
de-mixing architectures are investigated. Comparing with the performance on the
clean speaker embeddings, the obtained results show that one of the proposed
architectures obtained close performance, reaching 96.9% identification
accuracy and 0.89 cosine similarity.Comment: Published at SLT202
Boosting Unknown-number Speaker Separation with Transformer Decoder-based Attractor
We propose a novel speech separation model designed to separate mixtures with
an unknown number of speakers. The proposed model stacks 1) a dual-path
processing block that can model spectro-temporal patterns, 2) a transformer
decoder-based attractor (TDA) calculation module that can deal with an unknown
number of speakers, and 3) triple-path processing blocks that can model
inter-speaker relations. Given a fixed, small set of learned speaker queries
and the mixture embedding produced by the dual-path blocks, TDA infers the
relations of these queries and generates an attractor vector for each speaker.
The estimated attractors are then combined with the mixture embedding by
feature-wise linear modulation conditioning, creating a speaker dimension. The
mixture embedding, conditioned with speaker information produced by TDA, is fed
to the final triple-path blocks, which augment the dual-path blocks with an
additional pathway dedicated to inter-speaker processing. The proposed approach
outperforms the previous best reported in the literature, achieving 24.0 and
23.7 dB SI-SDR improvement (SI-SDRi) on WSJ0-2 and 3mix respectively, with a
single model trained to separate 2- and 3-speaker mixtures. The proposed model
also exhibits strong performance and generalizability at counting sources and
separating mixtures with up to 5 speakers.Comment: 5 pages, 4 figures, accepted by ICASSP 202
- …