422 research outputs found
Does Phase Matter For Monaural Source Separation?
The "cocktail party" problem of fully separating multiple sources from a
single channel audio waveform remains unsolved. Current biological
understanding of neural encoding suggests that phase information is preserved
and utilized at every stage of the auditory pathway. However, current
computational approaches primarily discard phase information in order to mask
amplitude spectrograms of sound. In this paper, we seek to address whether
preserving phase information in spectral representations of sound provides
better results in monaural separation of vocals from a musical track by using a
neurally plausible sparse generative model. Our results demonstrate that
preserving phase information reduces artifacts in the separated tracks, as
quantified by the signal to artifact ratio (GSAR). Furthermore, our proposed
method achieves state-of-the-art performance for source separation, as
quantified by a mean signal to interference ratio (GSIR) of 19.46.Comment: 4 pages, 2 figures, NIPS forma
Supervised Speech Separation Based on Deep Learning: An Overview
Speech separation is the task of separating target speech from background
interference. Traditionally, speech separation is studied as a signal
processing problem. A more recent approach formulates speech separation as a
supervised learning problem, where the discriminative patterns of speech,
speakers, and background noise are learned from training data. Over the past
decade, many supervised separation algorithms have been put forward. In
particular, the recent introduction of deep learning to supervised speech
separation has dramatically accelerated progress and boosted separation
performance. This article provides a comprehensive overview of the research on
deep learning based supervised speech separation in the last several years. We
first introduce the background of speech separation and the formulation of
supervised separation. Then we discuss three main components of supervised
separation: learning machines, training targets, and acoustic features. Much of
the overview is on separation algorithms where we review monaural methods,
including speech enhancement (speech-nonspeech separation), speaker separation
(multi-talker separation), and speech dereverberation, as well as
multi-microphone techniques. The important issue of generalization, unique to
supervised learning, is discussed. This overview provides a historical
perspective on how advances are made. In addition, we discuss a number of
conceptual issues, including what constitutes the target source.Comment: 27 pages, 17 figure
Music Source Separation Using Stacked Hourglass Networks
In this paper, we propose a simple yet effective method for multiple music
source separation using convolutional neural networks. Stacked hourglass
network, which was originally designed for human pose estimation in natural
images, is applied to a music source separation task. The network learns
features from a spectrogram image across multiple scales and generates masks
for each music source. The estimated mask is refined as it passes over stacked
hourglass modules. The proposed framework is able to separate multiple music
sources using a single network. Experimental results on MIR-1K and DSD100
datasets validate that the proposed method achieves competitive results
comparable to the state-of-the-art methods in multiple music source separation
and singing voice separation tasks.Comment: ISMIR 2018, source code:
https://github.com/sungheonpark/music_source_sepearation_SH_ne
Learning with Learned Loss Function: Speech Enhancement with Quality-Net to Improve Perceptual Evaluation of Speech Quality
Utilizing a human-perception-related objective function to train a speech
enhancement model has become a popular topic recently. The main reason is that
the conventional mean squared error (MSE) loss cannot represent auditory
perception well. One of the typical hu-man-perception-related metrics, which is
the perceptual evaluation of speech quality (PESQ), has been proven to provide
a high correlation to the quality scores rated by humans. Owing to its complex
and non-differentiable properties, however, the PESQ function may not be used
to optimize speech enhancement models directly. In this study, we propose
optimizing the enhancement model with an approximated PESQ function, which is
differentiable and learned from the training data. The experimental results
show that the learned surrogate function can guide the enhancement model to
further boost the PESQ score (in-crease of 0.18 points compared to the results
trained with MSE loss) and maintain the speech intelligibility.Comment: Accepted by IEEE Signal Processing Letters (SPL
Examining the Mapping Functions of Denoising Autoencoders in Singing Voice Separation
The goal of this work is to investigate what singing voice separation
approaches based on neural networks learn from the data. We examine the mapping
functions of neural networks based on the denoising autoencoder (DAE) model
that are conditioned on the mixture magnitude spectra. To approximate the
mapping functions, we propose an algorithm inspired by the knowledge
distillation, denoted the neural couplings algorithm (NCA). The NCA yields a
matrix that expresses the mapping of the mixture to the target source magnitude
information. Using the NCA, we examine the mapping functions of three
fundamental DAE-based models in music source separation; one with single-layer
encoder and decoder, one with multi-layer encoder and single-layer decoder, and
one using skip-filtering connections (SF) with a single-layer encoding and
decoding. We first train these models with realistic data to estimate the
singing voice magnitude spectra from the corresponding mixture. We then use the
optimized models and test spectral data as input to the NCA. Our experimental
findings show that approaches based on the DAE model learn scalar filtering
operators, exhibiting a predominant diagonal structure in their corresponding
mapping functions, limiting the exploitation of inter-frequency structure of
music data. In contrast, skip-filtering connections are shown to assist the DAE
model in learning filtering operators that exploit richer inter-frequency
structures
Complex ratio masking for singing voice separation
Music source separation is important for applications such as karaoke and
remixing. Much of previous research focuses on estimating short-time Fourier
transform (STFT) magnitude and discarding phase information. We observe that,
for singing voice separation, phase can make considerable improvement in
separation quality. This paper proposes a complex ratio masking method for
voice and accompaniment separation. The proposed method employs DenseUNet with
self attention to estimate the real and imaginary components of STFT for each
sound source. A simple ensemble technique is introduced to further improve
separation performance. Evaluation results demonstrate that the proposed method
outperforms recent state-of-the-art models for both separated voice and
accompaniment
A Unified Framework for Speech Separation
Speech separation refers to extracting each individual speech source in a
given mixed signal. Recent advancements in speech separation and ongoing
research in this area, have made these approaches as promising techniques for
pre-processing of naturalistic audio streams. After incorporating deep learning
techniques into speech separation, performance on these systems is improving
faster. The initial solutions introduced for deep learning based speech
separation analyzed the speech signals into time-frequency domain with STFT;
and then encoded mixed signals were fed into a deep neural network based
separator. Most recently, new methods are introduced to separate waveform of
the mixed signal directly without analyzing them using STFT. Here, we introduce
a unified framework to include both spectrogram and waveform separations into a
single structure, while being only different in the kernel function used to
encode and decode the data; where, both can achieve competitive performance.
This new framework provides flexibility; in addition, depending on the
characteristics of the data, or limitations of the memory and latency can set
the hyper-parameters to flow in a pipeline of the framework which fits the task
properly. We extend single-channel speech separation into multi-channel
framework with end-to-end training of the network while optimizing the speech
separation criterion (i.e., Si-SNR) directly. We emphasize on how tied kernel
functions for calculating spatial features, encoder, and decoder in
multi-channel framework can be effective. We simulate spatialized reverberate
data for both WSJ0 and LibriSpeech corpora here, and while these two sets of
data are different in the matter of size and duration, the effect of capturing
shorter and longer dependencies of previous/+future samples are studied in
detail. We report SDR, Si-SNR and PESQ to evaluate the performance of developed
solutions
Multi-microphone Complex Spectral Mapping for Utterance-wise and Continuous Speaker Separation
We propose multi-microphone complex spectral mapping, a simple way of
applying deep learning for time-varying non-linear beamforming, for offline
utterance-wise and block-online continuous speaker separation in reverberant
conditions, aiming at both speaker separation and dereverberation. Assuming a
fixed array geometry between training and testing, we train deep neural
networks (DNN) to predict the real and imaginary (RI) components of target
speech at a reference microphone from the RI components of multiple
microphones. We then integrate multi-microphone complex spectral mapping with
beamforming and post-filtering to further improve separation, and combine it
with frame-level speaker counting for block-online continuous speaker
separation (CSS). Although our system is trained on simulated room impulse
responses (RIR) based on a fixed number of microphones arranged in a given
geometry, it generalizes well to a real array with the same geometry.
State-of-the-art separation performance is obtained on the simulated two-talker
SMS-WSJ corpus and the real-recorded LibriCSS dataset.Comment: 10 pages, in submissio
Monaural Speech Enhancement Using a Multi-Branch Temporal Convolutional Network
Deep learning has achieved substantial improvement on single-channel speech
enhancement tasks. However, the performance of multi-layer perceptions
(MLPs)-based methods is limited by the ability to capture the long-term
effective history information. The recurrent neural networks (RNNs), e.g., long
short-term memory (LSTM) model, are able to capture the long-term temporal
dependencies, but come with the issues of the high latency and the complexity
of training.To address these issues, the temporal convolutional network (TCN)
was proposed to replace the RNNs in various sequence modeling tasks. In this
paper we propose a novel TCN model that employs multi-branch structure, called
multi-branch TCN (MB-TCN), for monaural speech enhancement.The MB-TCN exploits
split-transform-aggregate design, which is expected to obtain strong
representational power at a low computational complexity.Inspired by the TCN,
the MB-TCN model incorporates one dimensional causal dilated CNN and residual
learning to expand receptive fields for capturing long-term temporal contextual
information.Our extensive experimental investigation suggests that the MB-TCNs
outperform the residual long short-term memory networks (ResLSTMs), temporal
convolutional networks (TCNs), and the CNN networks that employ dense
aggregations in terms of speech intelligibility and quality, while providing
superior parameter efficiency. Furthermore, our experimental results
demonstrate that our proposed MB-TCN model is able to outperform multiple
state-of-the-art deep learning-based speech enhancement methods in terms of
five widely used objective metrics.Comment: There are some inappropriate decriptions. These descriptions exist on
many page
Recent Progresses in Deep Learning based Acoustic Models (Updated)
In this paper, we summarize recent progresses made in deep learning based
acoustic models and the motivation and insights behind the surveyed techniques.
We first discuss acoustic models that can effectively exploit variable-length
contextual information, such as recurrent neural networks (RNNs), convolutional
neural networks (CNNs), and their various combination with other models. We
then describe acoustic models that are optimized end-to-end with emphasis on
feature representations learned jointly with rest of the system, the
connectionist temporal classification (CTC) criterion, and the attention-based
sequence-to-sequence model. We further illustrate robustness issues in speech
recognition systems, and discuss acoustic model adaptation, speech enhancement
and separation, and robust training strategies. We also cover modeling
techniques that lead to more efficient decoding and discuss possible future
directions in acoustic model research.Comment: This is an updated version with latest literature until ICASSP2018 of
the paper: Dong Yu and Jinyu Li, "Recent Progresses in Deep Learning based
Acoustic Models," vol.4, no.3, IEEE/CAA Journal of Automatica Sinica, 201
- …