376 research outputs found
Multichannel Speech Enhancement by Raw Waveform-mapping using Fully Convolutional Networks
In recent years, waveform-mapping-based speech enhancement (SE) methods have
garnered significant attention. These methods generally use a deep learning
model to directly process and reconstruct speech waveforms. Because both the
input and output are in waveform format, the waveform-mapping-based SE methods
can overcome the distortion caused by imperfect phase estimation, which may be
encountered in spectral-mapping-based SE systems. So far, most
waveform-mapping-based SE methods have focused on single-channel tasks. In this
paper, we propose a novel fully convolutional network (FCN) with Sinc and
dilated convolutional layers (termed SDFCN) for multichannel SE that operates
in the time domain. We also propose an extended version of SDFCN, called the
residual SDFCN (termed rSDFCN). The proposed methods are evaluated on two
multichannel SE tasks, namely the dual-channel inner-ear microphones SE task
and the distributed microphones SE task. The experimental results confirm the
outstanding denoising capability of the proposed SE systems on both tasks and
the benefits of using the residual architecture on the overall SE performance.Comment: Accepted to IEEE/ACM Transactions on Audio, Speech and Language
Processin
RHR-Net: A Residual Hourglass Recurrent Neural Network for Speech Enhancement
Most current speech enhancement models use spectrogram features that require
an expensive transformation and result in phase information loss. Previous work
has overcome these issues by using convolutional networks to learn long-range
temporal correlations across high-resolution waveforms. These models, however,
are limited by memory-intensive dilated convolution and aliasing artifacts from
upsampling. We introduce an end-to-end fully-recurrent hourglass-shaped neural
network architecture with residual connections for waveform-based
single-channel speech enhancement. Our model can efficiently capture long-range
temporal dependencies by reducing the features resolution without information
loss. Experimental results show that our model outperforms state-of-the-art
approaches in six evaluation metrics
RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification
Recently, direct modeling of raw waveforms using deep neural networks has
been widely studied for a number of tasks in audio domains. In speaker
verification, however, utilization of raw waveforms is in its preliminary
phase, requiring further investigation. In this study, we explore end-to-end
deep neural networks that input raw waveforms to improve various aspects:
front-end speaker embedding extraction including model architecture,
pre-training scheme, additional objective functions, and back-end
classification. Adjustment of model architecture using a pre-training scheme
can extract speaker embeddings, giving a significant improvement in
performance. Additional objective functions simplify the process of extracting
speaker embeddings by merging conventional two-phase processes: extracting
utterance-level features such as i-vectors or x-vectors and the feature
enhancement phase, e.g., linear discriminant analysis. Effective back-end
classification models that suit the proposed speaker embedding are also
explored. We propose an end-to-end system that comprises two deep neural
networks, one front-end for utterance-level speaker embedding extraction and
the other for back-end classification. Experiments conducted on the VoxCeleb1
dataset demonstrate that the proposed model achieves state-of-the-art
performance among systems without data augmentation. The proposed system is
also comparable to the state-of-the-art x-vector system that adopts data
augmentation.Comment: Accepted for oral presentation at Interspeech 2019, code available at
http://github.com/Jungjee/RawNe
Supervised Speech Separation Based on Deep Learning: An Overview
Speech separation is the task of separating target speech from background
interference. Traditionally, speech separation is studied as a signal
processing problem. A more recent approach formulates speech separation as a
supervised learning problem, where the discriminative patterns of speech,
speakers, and background noise are learned from training data. Over the past
decade, many supervised separation algorithms have been put forward. In
particular, the recent introduction of deep learning to supervised speech
separation has dramatically accelerated progress and boosted separation
performance. This article provides a comprehensive overview of the research on
deep learning based supervised speech separation in the last several years. We
first introduce the background of speech separation and the formulation of
supervised separation. Then we discuss three main components of supervised
separation: learning machines, training targets, and acoustic features. Much of
the overview is on separation algorithms where we review monaural methods,
including speech enhancement (speech-nonspeech separation), speaker separation
(multi-talker separation), and speech dereverberation, as well as
multi-microphone techniques. The important issue of generalization, unique to
supervised learning, is discussed. This overview provides a historical
perspective on how advances are made. In addition, we discuss a number of
conceptual issues, including what constitutes the target source.Comment: 27 pages, 17 figure
Overlapped speech recognition from a jointly learned multi-channel neural speech extraction and representation
We propose an end-to-end joint optimization framework of a multi-channel
neural speech extraction and deep acoustic model without mel-filterbank (FBANK)
extraction for overlapped speech recognition. First, based on a multi-channel
convolutional TasNet with STFT kernel, we unify the multi-channel target speech
enhancement front-end network and a convolutional, long short-term memory and
fully connected deep neural network (CLDNN) based acoustic model (AM) with the
FBANK extraction layer to build a hybrid neural network, which is thus jointly
updated only by the recognition loss. The proposed framework achieves 28% word
error rate reduction (WERR) over a separately optimized system on AISHELL-1 and
shows consistent robustness to signal to interference ratio (SIR) and angle
difference between overlapping speakers. Next, a further exploration shows that
the speech recognition is improved with a simplified structure by replacing the
FBANK extraction layer in the joint model with a learnable feature projection.
Finally, we also perform the objective measurement of speech quality on the
reconstructed waveform from the enhancement network in the joint model
Recent Progresses in Deep Learning based Acoustic Models (Updated)
In this paper, we summarize recent progresses made in deep learning based
acoustic models and the motivation and insights behind the surveyed techniques.
We first discuss acoustic models that can effectively exploit variable-length
contextual information, such as recurrent neural networks (RNNs), convolutional
neural networks (CNNs), and their various combination with other models. We
then describe acoustic models that are optimized end-to-end with emphasis on
feature representations learned jointly with rest of the system, the
connectionist temporal classification (CTC) criterion, and the attention-based
sequence-to-sequence model. We further illustrate robustness issues in speech
recognition systems, and discuss acoustic model adaptation, speech enhancement
and separation, and robust training strategies. We also cover modeling
techniques that lead to more efficient decoding and discuss possible future
directions in acoustic model research.Comment: This is an updated version with latest literature until ICASSP2018 of
the paper: Dong Yu and Jinyu Li, "Recent Progresses in Deep Learning based
Acoustic Models," vol.4, no.3, IEEE/CAA Journal of Automatica Sinica, 201
Single Channel Speech Enhancement Using Temporal Convolutional Recurrent Neural Networks
In recent decades, neural network based methods have significantly improved
the performace of speech enhancement. Most of them estimate time-frequency
(T-F) representation of target speech directly or indirectly, then resynthesize
waveform using the estimated T-F representation. In this work, we proposed the
temporal convolutional recurrent network (TCRN), an end-to-end model that
directly map noisy waveform to clean waveform. The TCRN, which is combined
convolution and recurrent neural network, is able to efficiently and
effectively leverage short-term ang long-term information. Futuremore, we
present the architecture that repeatedly downsample and upsample speech during
forward propagation. We show that our model is able to improve the performance
of model, compared with existing convolutional recurrent networks. Futuremore,
We present several key techniques to stabilize the training process. The
experimental results show that our model consistently outperforms existing
speech enhancement approaches, in terms of speech intelligibility and quality
Distributed Microphone Speech Enhancement based on Deep Learning
Speech-related applications deliver inferior performance in complex noise
environments. Therefore, this study primarily addresses this problem by
introducing speech-enhancement (SE) systems based on deep neural networks
(DNNs) applied to a distributed microphone architecture, and then investigates
the effectiveness of three different DNN-model structures. The first system
constructs a DNN model for each microphone to enhance the recorded noisy speech
signal, and the second system combines all the noisy recordings into a large
feature structure that is then enhanced through a DNN model. As for the third
system, a channel-dependent DNN is first used to enhance the corresponding
noisy input, and all the channel-wise enhanced outputs are fed into a DNN
fusion model to construct a nearly clean signal. All the three DNN SE systems
are operated in the acoustic frequency domain of speech signals in a
diffuse-noise field environment. Evaluation experiments were conducted on the
Taiwan Mandarin Hearing in Noise Test (TMHINT) database, and the results
indicate that all the three DNN-based SE systems provide the original
noise-corrupted signals with improved speech quality and intelligibility,
whereas the third system delivers the highest signal-to-noise ratio (SNR)
improvement and optimal speech intelligibility.Comment: deep neural network, multi-channel speech enhancement, distributed
microphone architecture, diffuse noise environmen
MIMO Speech Compression and Enhancement Based on Convolutional Denoising Autoencoder
For speech-related applications in IoT environments, identifying effective
methods to handle interference noises and compress the amount of data in
transmissions is essential to achieve high-quality services. In this study, we
propose a novel multi-input multi-output speech compression and enhancement
(MIMO-SCE) system based on a convolutional denoising autoencoder (CDAE) model
to simultaneously improve speech quality and reduce the dimensions of
transmission data. Compared with conventional single-channel and multi-input
single-output systems, MIMO systems can be employed in applications that handle
multiple acoustic signals need to be handled. We investigated two CDAE models,
a fully convolutional network (FCN) and a Sinc FCN, as the core models in MIMO
systems. The experimental results confirm that the proposed MIMO-SCE framework
effectively improves speech quality and intelligibility while reducing the
amount of recording data by a factor of 7 for transmission
Exploring Speech Enhancement with Generative Adversarial Networks for Robust Speech Recognition
We investigate the effectiveness of generative adversarial networks (GANs)
for speech enhancement, in the context of improving noise robustness of
automatic speech recognition (ASR) systems. Prior work demonstrates that GANs
can effectively suppress additive noise in raw waveform speech signals,
improving perceptual quality metrics; however this technique was not justified
in the context of ASR. In this work, we conduct a detailed study to measure the
effectiveness of GANs in enhancing speech contaminated by both additive and
reverberant noise. Motivated by recent advances in image processing, we propose
operating GANs on log-Mel filterbank spectra instead of waveforms, which
requires less computation and is more robust to reverberant noise. While GAN
enhancement improves the performance of a clean-trained ASR system on noisy
speech, it falls short of the performance achieved by conventional multi-style
training (MTR). By appending the GAN-enhanced features to the noisy inputs and
retraining, we achieve a 7% WER improvement relative to the MTR system.Comment: Published as a conference paper at ICASSP 201
- …