3,951 research outputs found
Deep Griffin-Lim Iteration
This paper presents a novel phase reconstruction method (only from a given
amplitude spectrogram) by combining a signal-processing-based approach and a
deep neural network (DNN). To retrieve a time-domain signal from its amplitude
spectrogram, the corresponding phase is required. One of the popular phase
reconstruction methods is the Griffin-Lim algorithm (GLA), which is based on
the redundancy of the short-time Fourier transform. However, GLA often involves
many iterations and produces low-quality signals owing to the lack of prior
knowledge of the target signal. In order to address these issues, in this
study, we propose an architecture which stacks a sub-block including two
GLA-inspired fixed layers and a DNN. The number of stacked sub-blocks is
adjustable, and we can trade the performance and computational load based on
requirements of applications. The effectiveness of the proposed method is
investigated by reconstructing phases from amplitude spectrograms of speeches.Comment: 5 pages, to appear in IEEE ICASSP 2019 (Paper Code: AASP-L3.1,
Session: Source Separation and Speech Enhancement I
Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning
We present Deep Voice 3, a fully-convolutional attention-based neural
text-to-speech (TTS) system. Deep Voice 3 matches state-of-the-art neural
speech synthesis systems in naturalness while training ten times faster. We
scale Deep Voice 3 to data set sizes unprecedented for TTS, training on more
than eight hundred hours of audio from over two thousand speakers. In addition,
we identify common error modes of attention-based speech synthesis networks,
demonstrate how to mitigate them, and compare several different waveform
synthesis methods. We also describe how to scale inference to ten million
queries per day on one single-GPU server.Comment: Published as a conference paper at ICLR 2018. (v3 changed paper
title
Mixed-Precision Training for NLP and Speech Recognition with OpenSeq2Seq
We present OpenSeq2Seq - a TensorFlow-based toolkit for training
sequence-to-sequence models that features distributed and mixed-precision
training. Benchmarks on machine translation and speech recognition tasks show
that models built using OpenSeq2Seq give state-of-the-art performance at 1.5-3x
less training time. OpenSeq2Seq currently provides building blocks for models
that solve a wide range of tasks including neural machine translation,
automatic speech recognition, and speech synthesis.Comment: Presented at Workshop for Natural Language Processing Open Source
Software (NLP-OSS), co-located with ACL201
End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction
This paper proposes an end-to-end approach for single-channel
speaker-independent multi-speaker speech separation, where time-frequency (T-F)
masking, the short-time Fourier transform (STFT), and its inverse are
represented as layers within a deep network. Previous approaches, rather than
computing a loss on the reconstructed signal, used a surrogate loss based on
the target STFT magnitudes. This ignores reconstruction error introduced by
phase inconsistency. In our approach, the loss function is directly defined on
the reconstructed signals, which are optimized for best separation. In
addition, we train through unfolded iterations of a phase reconstruction
algorithm, represented as a series of STFT and inverse STFT layers. While mask
values are typically limited to lie between zero and one for approaches using
the mixture phase for reconstruction, this limitation is less relevant if the
estimated magnitudes are to be used together with phase reconstruction. We thus
propose several novel activation functions for the output layer of the T-F
masking, to allow mask values beyond one. On the publicly-available wsj0-2mix
dataset, our approach achieves state-of-the-art 12.6 dB scale-invariant
signal-to-distortion ratio (SI-SDR) and 13.1 dB SDR, revealing new
possibilities for deep learning based phase reconstruction and representing a
fundamental progress towards solving the notoriously-hard cocktail party
problem.Comment: Submitted to Interspeech 201
WaveCycleGAN2: Time-domain Neural Post-filter for Speech Waveform Generation
WaveCycleGAN has recently been proposed to bridge the gap between natural and
synthesized speech waveforms in statistical parametric speech synthesis and
provides fast inference with a moving average model rather than an
autoregressive model and high-quality speech synthesis with the adversarial
training. However, the human ear can still distinguish the processed speech
waveforms from natural ones. One possible cause of this distinguishability is
the aliasing observed in the processed speech waveform via down/up-sampling
modules. To solve the aliasing and provide higher quality speech synthesis, we
propose WaveCycleGAN2, which 1) uses generators without down/up-sampling
modules and 2) combines discriminators of the waveform domain and acoustic
parameter domain. The results show that the proposed method 1) alleviates the
aliasing well, 2) is useful for both speech waveforms generated by
analysis-and-synthesis and statistical parametric speech synthesis, and 3)
achieves a mean opinion score comparable to those of natural speech and speech
synthesized by WaveNet (open WaveNet) and WaveGlow while processing speech
samples at a rate of more than 150 kHz on an NVIDIA Tesla P100.Comment: Submitted to INTERSPEECH201
Deep Voice 2: Multi-Speaker Neural Text-to-Speech
We introduce a technique for augmenting neural text-to-speech (TTS) with
lowdimensional trainable speaker embeddings to generate different voices from a
single model. As a starting point, we show improvements over the two
state-ofthe-art approaches for single-speaker neural TTS: Deep Voice 1 and
Tacotron. We introduce Deep Voice 2, which is based on a similar pipeline with
Deep Voice 1, but constructed with higher performance building blocks and
demonstrates a significant audio quality improvement over Deep Voice 1. We
improve Tacotron by introducing a post-processing neural vocoder, and
demonstrate a significant audio quality improvement. We then demonstrate our
technique for multi-speaker speech synthesis for both Deep Voice 2 and Tacotron
on two multi-speaker TTS datasets. We show that a single neural TTS system can
learn hundreds of unique voices from less than half an hour of data per
speaker, while achieving high audio quality synthesis and preserving the
speaker identities almost perfectly.Comment: Accepted in NIPS 201
Discriminant Projection Representation-based Classification for Vision Recognition
Representation-based classification methods such as sparse
representation-based classification (SRC) and linear regression classification
(LRC) have attracted a lot of attentions. In order to obtain the better
representation, a novel method called projection representation-based
classification (PRC) is proposed for image recognition in this paper. PRC is
based on a new mathematical model. This model denotes that the 'ideal
projection' of a sample point on the hyper-space may be gained by
iteratively computing the projection of on a line of hyper-space with
the proper strategy. Therefore, PRC is able to iteratively approximate the
'ideal representation' of each subject for classification. Moreover, the
discriminant PRC (DPRC) is further proposed, which obtains the discriminant
information by maximizing the ratio of the between-class reconstruction error
over the within-class reconstruction error. Experimental results on five
typical databases show that the proposed PRC and DPRC are effective and
outperform other state-of-the-art methods on several vision recognition tasks.Comment: Accepted by the Thirty-Second AAAI Conference on Artificial
Intelligence (AAAI-18
Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention
This paper describes a novel text-to-speech (TTS) technique based on deep
convolutional neural networks (CNN), without use of any recurrent units.
Recurrent neural networks (RNN) have become a standard technique to model
sequential data recently, and this technique has been used in some cutting-edge
neural TTS techniques. However, training RNN components often requires a very
powerful computer, or a very long time, typically several days or weeks. Recent
other studies, on the other hand, have shown that CNN-based sequence synthesis
can be much faster than RNN-based techniques, because of high
parallelizability. The objective of this paper is to show that an alternative
neural TTS based only on CNN alleviate these economic costs of training. In our
experiment, the proposed Deep Convolutional TTS was sufficiently trained
overnight (15 hours), using an ordinary gaming PC equipped with two GPUs, while
the quality of the synthesized speech was almost acceptable.Comment: 5 pages, 3figures, IEEE ICASSP 201
WaveTTS: Tacotron-based TTS with Joint Time-Frequency Domain Loss
Tacotron-based text-to-speech (TTS) systems directly synthesize speech from
text input. Such frameworks typically consist of a feature prediction network
that maps character sequences to frequency-domain acoustic features, followed
by a waveform reconstruction algorithm or a neural vocoder that generates the
time-domain waveform from acoustic features. As the loss function is usually
calculated only for frequency-domain acoustic features, that doesn't directly
control the quality of the generated time-domain waveform. To address this
problem, we propose a new training scheme for Tacotron-based TTS, referred to
as WaveTTS, that has 2 loss functions: 1) time-domain loss, denoted as the
waveform loss, that measures the distortion between the natural and generated
waveform; and 2) frequency-domain loss, that measures the Mel-scale acoustic
feature loss between the natural and generated acoustic features. WaveTTS
ensures both the quality of the acoustic features and the resulting speech
waveform. To our best knowledge, this is the first implementation of Tacotron
with joint time-frequency domain loss. Experimental results show that the
proposed framework outperforms the baselines and achieves high-quality
synthesized speech.Comment: To appear at Odyssey 2020, Tokyo, Japa
GELP: GAN-Excited Linear Prediction for Speech Synthesis from Mel-spectrogram
Recent advances in neural network -based text-to-speech have reached human
level naturalness in synthetic speech. The present sequence-to-sequence models
can directly map text to mel-spectrogram acoustic features, which are
convenient for modeling, but present additional challenges for vocoding (i.e.,
waveform generation from the acoustic features). High-quality synthesis can be
achieved with neural vocoders, such as WaveNet, but such autoregressive models
suffer from slow sequential inference. Meanwhile, their existing parallel
inference counterparts are difficult to train and require increasingly large
model sizes. In this paper, we propose an alternative training strategy for a
parallel neural vocoder utilizing generative adversarial networks, and
integrate a linear predictive synthesis filter into the model. Results show
that the proposed model achieves significant improvement in inference speed,
while outperforming a WaveNet in copy-synthesis quality.Comment: Interspeech 2019 accepted versio
- …