5,798 research outputs found
An Improved Deep Neural Network for Modeling Speaker Characteristics at Different Temporal Scales
This paper presents an improved deep embedding learning method based on
convolutional neural network (CNN) for text-independent speaker verification.
Two improvements are proposed for x-vector embedding learning: (1) Multi-scale
convolution (MSCNN) is adopted in frame-level layers to capture complementary
speaker information in different receptive fields. (2) A Baum-Welch statistics
attention (BWSA) mechanism is applied in pooling-layer, which can integrate
more useful long-term speaker characteristics in the temporal pooling layer.
Experiments are carried out on the NIST SRE16 evaluation set. The results
demonstrate the effectiveness of MSCNN and show the proposed BWSA can further
improve the performance of the DNN embedding systemComment: 5 pages,2 figure
Recent Progresses in Deep Learning based Acoustic Models (Updated)
In this paper, we summarize recent progresses made in deep learning based
acoustic models and the motivation and insights behind the surveyed techniques.
We first discuss acoustic models that can effectively exploit variable-length
contextual information, such as recurrent neural networks (RNNs), convolutional
neural networks (CNNs), and their various combination with other models. We
then describe acoustic models that are optimized end-to-end with emphasis on
feature representations learned jointly with rest of the system, the
connectionist temporal classification (CTC) criterion, and the attention-based
sequence-to-sequence model. We further illustrate robustness issues in speech
recognition systems, and discuss acoustic model adaptation, speech enhancement
and separation, and robust training strategies. We also cover modeling
techniques that lead to more efficient decoding and discuss possible future
directions in acoustic model research.Comment: This is an updated version with latest literature until ICASSP2018 of
the paper: Dong Yu and Jinyu Li, "Recent Progresses in Deep Learning based
Acoustic Models," vol.4, no.3, IEEE/CAA Journal of Automatica Sinica, 201
A general-purpose deep learning approach to model time-varying audio effects
Audio processors whose parameters are modified periodically over time are
often referred as time-varying or modulation based audio effects. Most existing
methods for modeling these type of effect units are often optimized to a very
specific circuit and cannot be efficiently generalized to other time-varying
effects. Based on convolutional and recurrent neural networks, we propose a
deep learning architecture for generic black-box modeling of audio processors
with long-term memory. We explore the capabilities of deep neural networks to
learn such long temporal dependencies and we show the network modeling various
linear and nonlinear, time-varying and time-invariant audio effects. In order
to measure the performance of the model, we propose an objective metric based
on the psychoacoustics of modulation frequency perception. We also analyze what
the model is actually learning and how the given task is accomplished.Comment: audio files: https://mchijmma.github.io/modeling-time-varying
Machine learning in acoustics: theory and applications
Acoustic data provide scientific and engineering insights in fields ranging
from biology and communications to ocean and Earth science. We survey the
recent advances and transformative potential of machine learning (ML),
including deep learning, in the field of acoustics. ML is a broad family of
techniques, which are often based in statistics, for automatically detecting
and utilizing patterns in data. Relative to conventional acoustics and signal
processing, ML is data-driven. Given sufficient training data, ML can discover
complex relationships between features and desired labels or actions, or
between features themselves. With large volumes of training data, ML can
discover models describing complex acoustic phenomena such as human speech and
reverberation. ML in acoustics is rapidly developing with compelling results
and significant future promise. We first introduce ML, then highlight ML
developments in four acoustics research areas: source localization in speech
processing, source localization in ocean acoustics, bioacoustics, and
environmental sounds in everyday scenes.Comment: Published with free access in Journal of the Acoustical Society of
America, 27 Nov. 201
Improved TDNNs using Deep Kernels and Frequency Dependent Grid-RNNs
Time delay neural networks (TDNNs) are an effective acoustic model for large
vocabulary speech recognition. The strength of the model can be attributed to
its ability to effectively model long temporal contexts. However, current TDNN
models are relatively shallow, which limits the modelling capability. This
paper proposes a method of increasing the network depth by deepening the kernel
used in the TDNN temporal convolutions. The best performing kernel consists of
three fully connected layers with a residual (ResNet) connection from the
output of the first to the output of the third. The addition of
spectro-temporal processing as the input to the TDNN in the form of a
convolutional neural network (CNN) and a newly designed Grid-RNN was
investigated. The Grid-RNN strongly outperforms a CNN if different sets of
parameters for different frequency bands are used and can be further enhanced
by using a bi-directional Grid-RNN. Experiments using the multi-genre broadcast
(MGB3) English data (275h) show that deep kernel TDNNs reduces the word error
rate (WER) by 6% relative and when combined with the frequency dependent
Grid-RNN gives a relative WER reduction of 9%.Comment: 5 pages, 3 figures, 2 tables, to appear in 2018 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP 2018
Supervised Speech Separation Based on Deep Learning: An Overview
Speech separation is the task of separating target speech from background
interference. Traditionally, speech separation is studied as a signal
processing problem. A more recent approach formulates speech separation as a
supervised learning problem, where the discriminative patterns of speech,
speakers, and background noise are learned from training data. Over the past
decade, many supervised separation algorithms have been put forward. In
particular, the recent introduction of deep learning to supervised speech
separation has dramatically accelerated progress and boosted separation
performance. This article provides a comprehensive overview of the research on
deep learning based supervised speech separation in the last several years. We
first introduce the background of speech separation and the formulation of
supervised separation. Then we discuss three main components of supervised
separation: learning machines, training targets, and acoustic features. Much of
the overview is on separation algorithms where we review monaural methods,
including speech enhancement (speech-nonspeech separation), speaker separation
(multi-talker separation), and speech dereverberation, as well as
multi-microphone techniques. The important issue of generalization, unique to
supervised learning, is discussed. This overview provides a historical
perspective on how advances are made. In addition, we discuss a number of
conceptual issues, including what constitutes the target source.Comment: 27 pages, 17 figure
Representation Learning with Contrastive Predictive Coding
While supervised learning has enabled great progress in many applications,
unsupervised learning has not seen such widespread adoption, and remains an
important and challenging endeavor for artificial intelligence. In this work,
we propose a universal unsupervised learning approach to extract useful
representations from high-dimensional data, which we call Contrastive
Predictive Coding. The key insight of our model is to learn such
representations by predicting the future in latent space by using powerful
autoregressive models. We use a probabilistic contrastive loss which induces
the latent space to capture information that is maximally useful to predict
future samples. It also makes the model tractable by using negative sampling.
While most prior work has focused on evaluating representations for a
particular modality, we demonstrate that our approach is able to learn useful
representations achieving strong performance on four distinct domains: speech,
images, text and reinforcement learning in 3D environments
Spectrum and Prosody Conversion for Cross-lingual Voice Conversion with CycleGAN
Cross-lingual voice conversion aims to change source speaker's voice to sound
like that of target speaker, when source and target speakers speak different
languages. It relies on non-parallel training data from two different
languages, hence, is more challenging than mono-lingual voice conversion.
Previous studies on cross-lingual voice conversion mainly focus on spectral
conversion with a linear transformation for F0 transfer. However, as an
important prosodic factor, F0 is inherently hierarchical, thus it is
insufficient to just use a linear method for conversion. We propose the use of
continuous wavelet transform (CWT) decomposition for F0 modeling. CWT provides
a way to decompose a signal into different temporal scales that explain prosody
in different time resolutions. We also propose to train two CycleGAN pipelines
for spectrum and prosody mapping respectively. In this way, we eliminate the
need for parallel data of any two languages and any alignment techniques.
Experimental results show that our proposed Spectrum-Prosody-CycleGAN framework
outperforms the Spectrum-CycleGAN baseline in subjective evaluation. To our
best knowledge, this is the first study of prosody in cross-lingual voice
conversion.Comment: Accepted to APSIPA ASC 202
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
We show that an end-to-end deep learning approach can be used to recognize
either English or Mandarin Chinese speech--two vastly different languages.
Because it replaces entire pipelines of hand-engineered components with neural
networks, end-to-end learning allows us to handle a diverse variety of speech
including noisy environments, accents and different languages. Key to our
approach is our application of HPC techniques, resulting in a 7x speedup over
our previous system. Because of this efficiency, experiments that previously
took weeks now run in days. This enables us to iterate more quickly to identify
superior architectures and algorithms. As a result, in several cases, our
system is competitive with the transcription of human workers when benchmarked
on standard datasets. Finally, using a technique called Batch Dispatch with
GPUs in the data center, we show that our system can be inexpensively deployed
in an online setting, delivering low latency when serving users at scale
MelNet: A Generative Model for Audio in the Frequency Domain
Capturing high-level structure in audio waveforms is challenging because a
single second of audio spans tens of thousands of timesteps. While long-range
dependencies are difficult to model directly in the time domain, we show that
they can be more tractably modelled in two-dimensional time-frequency
representations such as spectrograms. By leveraging this representational
advantage, in conjunction with a highly expressive probabilistic model and a
multiscale generation procedure, we design a model capable of generating
high-fidelity audio samples which capture structure at timescales that
time-domain models have yet to achieve. We apply our model to a variety of
audio generation tasks, including unconditional speech generation, music
generation, and text-to-speech synthesis---showing improvements over previous
approaches in both density estimates and human judgments
- …