Search CORE

3,951 research outputs found

Deep Griffin-Lim Iteration

Author: Harada Noboru
Koizumi Yuma
Masuyama Yoshiki
Oikawa Yasuhiro
Yatabe Kohei
Publication venue
Publication date: 10/03/2019
Field of study

This paper presents a novel phase reconstruction method (only from a given amplitude spectrogram) by combining a signal-processing-based approach and a deep neural network (DNN). To retrieve a time-domain signal from its amplitude spectrogram, the corresponding phase is required. One of the popular phase reconstruction methods is the Griffin-Lim algorithm (GLA), which is based on the redundancy of the short-time Fourier transform. However, GLA often involves many iterations and produces low-quality signals owing to the lack of prior knowledge of the target signal. In order to address these issues, in this study, we propose an architecture which stacks a sub-block including two GLA-inspired fixed layers and a DNN. The number of stacked sub-blocks is adjustable, and we can trade the performance and computational load based on requirements of applications. The effectiveness of the proposed method is investigated by reconstructing phases from amplitude spectrograms of speeches.Comment: 5 pages, to appear in IEEE ICASSP 2019 (Paper Code: AASP-L3.1, Session: Source Separation and Speech Enhancement I

arXiv.org e-Print Archive

Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning

Author: Arik Sercan O.
Gibiansky Andrew
Kannan Ajay
Miller John
Narang Sharan
Peng Kainan
Ping Wei
Raiman Jonathan
Publication venue
Publication date: 22/02/2018
Field of study

We present Deep Voice 3, a fully-convolutional attention-based neural text-to-speech (TTS) system. Deep Voice 3 matches state-of-the-art neural speech synthesis systems in naturalness while training ten times faster. We scale Deep Voice 3 to data set sizes unprecedented for TTS, training on more than eight hundred hours of audio from over two thousand speakers. In addition, we identify common error modes of attention-based speech synthesis networks, demonstrate how to mitigate them, and compare several different waveform synthesis methods. We also describe how to scale inference to ten million queries per day on one single-GPU server.Comment: Published as a conference paper at ICLR 2018. (v3 changed paper title

arXiv.org e-Print Archive

Mixed-Precision Training for NLP and Speech Recognition with OpenSeq2Seq

Author: Case Carl
Ginsburg Boris
Gitman Igor
Kuchaiev Oleksii
Lavrukhin Vitaly
Li Jason
Micikevicius Paulius
Nguyen Huyen
Publication venue
Publication date: 21/11/2018
Field of study

We present OpenSeq2Seq - a TensorFlow-based toolkit for training sequence-to-sequence models that features distributed and mixed-precision training. Benchmarks on machine translation and speech recognition tasks show that models built using OpenSeq2Seq give state-of-the-art performance at 1.5-3x less training time. OpenSeq2Seq currently provides building blocks for models that solve a wide range of tasks including neural machine translation, automatic speech recognition, and speech synthesis.Comment: Presented at Workshop for Natural Language Processing Open Source Software (NLP-OSS), co-located with ACL201

arXiv.org e-Print Archive

End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction

Author: Hershey John R.
Roux Jonathan Le
Wang DeLiang
Wang Zhong-Qiu
Publication venue
Publication date: 26/04/2018
Field of study

This paper proposes an end-to-end approach for single-channel speaker-independent multi-speaker speech separation, where time-frequency (T-F) masking, the short-time Fourier transform (STFT), and its inverse are represented as layers within a deep network. Previous approaches, rather than computing a loss on the reconstructed signal, used a surrogate loss based on the target STFT magnitudes. This ignores reconstruction error introduced by phase inconsistency. In our approach, the loss function is directly defined on the reconstructed signals, which are optimized for best separation. In addition, we train through unfolded iterations of a phase reconstruction algorithm, represented as a series of STFT and inverse STFT layers. While mask values are typically limited to lie between zero and one for approaches using the mixture phase for reconstruction, this limitation is less relevant if the estimated magnitudes are to be used together with phase reconstruction. We thus propose several novel activation functions for the output layer of the T-F masking, to allow mask values beyond one. On the publicly-available wsj0-2mix dataset, our approach achieves state-of-the-art 12.6 dB scale-invariant signal-to-distortion ratio (SI-SDR) and 13.1 dB SDR, revealing new possibilities for deep learning based phase reconstruction and representing a fundamental progress towards solving the notoriously-hard cocktail party problem.Comment: Submitted to Interspeech 201

arXiv.org e-Print Archive

WaveCycleGAN2: Time-domain Neural Post-filter for Speech Waveform Generation

Author: Hojo Nobukatsu
Kameoka Hirokazu
Kaneko Takuhiro
Tanaka Kou
Publication venue
Publication date: 08/04/2019
Field of study

WaveCycleGAN has recently been proposed to bridge the gap between natural and synthesized speech waveforms in statistical parametric speech synthesis and provides fast inference with a moving average model rather than an autoregressive model and high-quality speech synthesis with the adversarial training. However, the human ear can still distinguish the processed speech waveforms from natural ones. One possible cause of this distinguishability is the aliasing observed in the processed speech waveform via down/up-sampling modules. To solve the aliasing and provide higher quality speech synthesis, we propose WaveCycleGAN2, which 1) uses generators without down/up-sampling modules and 2) combines discriminators of the waveform domain and acoustic parameter domain. The results show that the proposed method 1) alleviates the aliasing well, 2) is useful for both speech waveforms generated by analysis-and-synthesis and statistical parametric speech synthesis, and 3) achieves a mean opinion score comparable to those of natural speech and speech synthesized by WaveNet (open WaveNet) and WaveGlow while processing speech samples at a rate of more than 150 kHz on an NVIDIA Tesla P100.Comment: Submitted to INTERSPEECH201

arXiv.org e-Print Archive

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

Author: Arik Sercan
Diamos Gregory
Gibiansky Andrew
Miller John
Peng Kainan
Ping Wei
Raiman Jonathan
Zhou Yanqi
Publication venue
Publication date: 20/09/2017
Field of study

We introduce a technique for augmenting neural text-to-speech (TTS) with lowdimensional trainable speaker embeddings to generate different voices from a single model. As a starting point, we show improvements over the two state-ofthe-art approaches for single-speaker neural TTS: Deep Voice 1 and Tacotron. We introduce Deep Voice 2, which is based on a similar pipeline with Deep Voice 1, but constructed with higher performance building blocks and demonstrates a significant audio quality improvement over Deep Voice 1. We improve Tacotron by introducing a post-processing neural vocoder, and demonstrate a significant audio quality improvement. We then demonstrate our technique for multi-speaker speech synthesis for both Deep Voice 2 and Tacotron on two multi-speaker TTS datasets. We show that a single neural TTS system can learn hundreds of unique voices from less than half an hour of data per speaker, while achieving high audio quality synthesis and preserving the speaker identities almost perfectly.Comment: Accepted in NIPS 201

arXiv.org e-Print Archive

Discriminant Projection Representation-based Classification for Vision Recognition

Author: Feng Qingxiang
Zhou Yicong
Publication venue
Publication date: 19/11/2017
Field of study

Representation-based classification methods such as sparse representation-based classification (SRC) and linear regression classification (LRC) have attracted a lot of attentions. In order to obtain the better representation, a novel method called projection representation-based classification (PRC) is proposed for image recognition in this paper. PRC is based on a new mathematical model. This model denotes that the 'ideal projection' of a sample point

x

on the hyper-space

H

may be gained by iteratively computing the projection of

x

on a line of hyper-space

H

with the proper strategy. Therefore, PRC is able to iteratively approximate the 'ideal representation' of each subject for classification. Moreover, the discriminant PRC (DPRC) is further proposed, which obtains the discriminant information by maximizing the ratio of the between-class reconstruction error over the within-class reconstruction error. Experimental results on five typical databases show that the proposed PRC and DPRC are effective and outperform other state-of-the-art methods on several vision recognition tasks.Comment: Accepted by the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18

arXiv.org e-Print Archive

Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention

Author: Aihara Shunsuke
Tachibana Hideyuki
Uenoyama Katsuya
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 30/09/2020
Field of study

This paper describes a novel text-to-speech (TTS) technique based on deep convolutional neural networks (CNN), without use of any recurrent units. Recurrent neural networks (RNN) have become a standard technique to model sequential data recently, and this technique has been used in some cutting-edge neural TTS techniques. However, training RNN components often requires a very powerful computer, or a very long time, typically several days or weeks. Recent other studies, on the other hand, have shown that CNN-based sequence synthesis can be much faster than RNN-based techniques, because of high parallelizability. The objective of this paper is to show that an alternative neural TTS based only on CNN alleviate these economic costs of training. In our experiment, the proposed Deep Convolutional TTS was sufficiently trained overnight (15 hours), using an ordinary gaming PC equipped with two GPUs, while the quality of the synthesized speech was almost acceptable.Comment: 5 pages, 3figures, IEEE ICASSP 201

arXiv.org e-Print Archive

WaveTTS: Tacotron-based TTS with Joint Time-Frequency Domain Loss

Author: Bao Feilong
Gao Guanglai
Li Haizhou
Liu Rui
Sisman Berrak
Publication venue
Publication date: 06/04/2020
Field of study

Tacotron-based text-to-speech (TTS) systems directly synthesize speech from text input. Such frameworks typically consist of a feature prediction network that maps character sequences to frequency-domain acoustic features, followed by a waveform reconstruction algorithm or a neural vocoder that generates the time-domain waveform from acoustic features. As the loss function is usually calculated only for frequency-domain acoustic features, that doesn't directly control the quality of the generated time-domain waveform. To address this problem, we propose a new training scheme for Tacotron-based TTS, referred to as WaveTTS, that has 2 loss functions: 1) time-domain loss, denoted as the waveform loss, that measures the distortion between the natural and generated waveform; and 2) frequency-domain loss, that measures the Mel-scale acoustic feature loss between the natural and generated acoustic features. WaveTTS ensures both the quality of the acoustic features and the resulting speech waveform. To our best knowledge, this is the first implementation of Tacotron with joint time-frequency domain loss. Experimental results show that the proposed framework outperforms the baselines and achieves high-quality synthesized speech.Comment: To appear at Odyssey 2020, Tokyo, Japa

arXiv.org e-Print Archive

GELP: GAN-Excited Linear Prediction for Speech Synthesis from Mel-spectrogram

Author: Alku Paavo
Bollepalli Bajibabu
Juvela Lauri
Yamagishi Junichi
Publication venue
Publication date: 26/06/2019
Field of study

Recent advances in neural network -based text-to-speech have reached human level naturalness in synthetic speech. The present sequence-to-sequence models can directly map text to mel-spectrogram acoustic features, which are convenient for modeling, but present additional challenges for vocoding (i.e., waveform generation from the acoustic features). High-quality synthesis can be achieved with neural vocoders, such as WaveNet, but such autoregressive models suffer from slow sequential inference. Meanwhile, their existing parallel inference counterparts are difficult to train and require increasingly large model sizes. In this paper, we propose an alternative training strategy for a parallel neural vocoder utilizing generative adversarial networks, and integrate a linear predictive synthesis filter into the model. Results show that the proposed model achieves significant improvement in inference speed, while outperforming a WaveNet in copy-synthesis quality.Comment: Interspeech 2019 accepted versio

arXiv.org e-Print Archive