3 research outputs found
Incorporating Symbolic Sequential Modeling for Speech Enhancement
In a noisy environment, a lossy speech signal can be automatically restored
by a listener if he/she knows the language well. That is, with the built-in
knowledge of a "language model", a listener may effectively suppress noise
interference and retrieve the target speech signals. Accordingly, we argue that
familiarity with the underlying linguistic content of spoken utterances
benefits speech enhancement (SE) in noisy environments. In this study, in
addition to the conventional modeling for learning the acoustic noisy-clean
speech mapping, an abstract symbolic sequential modeling is incorporated into
the SE framework. This symbolic sequential modeling can be regarded as a
"linguistic constraint" in learning the acoustic noisy-clean speech mapping
function. In this study, the symbolic sequences for acoustic signals are
obtained as discrete representations with a Vector Quantized Variational
Autoencoder algorithm. The obtained symbols are able to capture high-level
phoneme-like content from speech signals. The experimental results demonstrate
that the proposed framework can obtain notable performance improvement in terms
of perceptual evaluation of speech quality (PESQ) and short-time objective
intelligibility (STOI) on the TIMIT dataset.Comment: Accepted to Interspeech 201
Speech Enhancement using a Deep Mixture of Experts
In this study we present a Deep Mixture of Experts (DMoE) neural-network
architecture for single microphone speech enhancement. By contrast to most
speech enhancement algorithms that overlook the speech variability mainly
caused by phoneme structure, our framework comprises a set of deep neural
networks (DNNs), each one of which is an 'expert' in enhancing a given speech
type corresponding to a phoneme. A gating DNN determines which expert is
assigned to a given speech segment. A speech presence probability (SPP) is then
obtained as a weighted average of the expert SPP decisions, with the weights
determined by the gating DNN. A soft spectral attenuation, based on the SPP, is
then applied to enhance the noisy speech signal. The experts and the gating
components of the DMoE network are trained jointly. As part of the training,
speech clustering into different subsets is performed in an unsupervised
manner. Therefore, unlike previous methods, a phoneme-labeled database is not
required for the training procedure. A series of experiments with different
noise types verified the applicability of the new algorithm to the task of
speech enhancement. The proposed scheme outperforms other schemes that either
do not consider phoneme structure or use a simpler training methodology
Supervised Speech Separation Based on Deep Learning: An Overview
Speech separation is the task of separating target speech from background
interference. Traditionally, speech separation is studied as a signal
processing problem. A more recent approach formulates speech separation as a
supervised learning problem, where the discriminative patterns of speech,
speakers, and background noise are learned from training data. Over the past
decade, many supervised separation algorithms have been put forward. In
particular, the recent introduction of deep learning to supervised speech
separation has dramatically accelerated progress and boosted separation
performance. This article provides a comprehensive overview of the research on
deep learning based supervised speech separation in the last several years. We
first introduce the background of speech separation and the formulation of
supervised separation. Then we discuss three main components of supervised
separation: learning machines, training targets, and acoustic features. Much of
the overview is on separation algorithms where we review monaural methods,
including speech enhancement (speech-nonspeech separation), speaker separation
(multi-talker separation), and speech dereverberation, as well as
multi-microphone techniques. The important issue of generalization, unique to
supervised learning, is discussed. This overview provides a historical
perspective on how advances are made. In addition, we discuss a number of
conceptual issues, including what constitutes the target source.Comment: 27 pages, 17 figure