Search CORE

114 research outputs found

A deep auto-encoder based low-dimensional feature extraction from FFT spectral envelopes for statistical parametric speech synthesis

Author: Takaki Shinji
Yamagishi Junichi
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/03/2016
Field of study

RawNet: Fast End-to-End Neural Vocoder

Author: He Yunchao
Wang Yujun
Zhang Haitong
Publication venue
Publication date: 10/04/2019
Field of study

Neural networks based vocoders have recently demonstrated the powerful ability to synthesize high quality speech. These models usually generate samples by conditioning on some spectrum features, such as Mel-spectrum. However, these features are extracted by using speech analysis module including some processing based on the human knowledge. In this work, we proposed RawNet, a truly end-to-end neural vocoder, which use a coder network to learn the higher representation of signal, and an autoregressive voder network to generate speech sample by sample. The coder and voder together act like an auto-encoder network, and could be jointly trained directly on raw waveform without any human-designed features. The experiments on the Copy-Synthesis tasks show that RawNet can achieve the comparative synthesized speech quality with LPCNet, with a smaller model architecture and faster speech generation at the inference step.Comment: Submitted to Interspeech 2019, Graz, Austri

arXiv.org e-Print Archive

Direct modeling of frequency spectra and waveform generation based on phase recovery for DNN-based speech synthesis

Author: Kameoka Hirokazu
Takaki Shinji
Yamagishi Junichi
Publication venue: 'International Speech Communication Association'
Publication date: 20/08/2017
Field of study

Crossref

Edinburgh Research Explorer

Constructing a Deep Neural Network based Spectral Model for Statistical Speech Synthesis

Author: G. E. Hinton
Geoffrey E. Hinton
H Kawahara
H Zen
Z-H Ling
Publication venue
Publication date: 17/06/2015
Field of study

Crossref

Edinburgh Research Explorer

Speaker Adaptation of Various Components in Deep Neural Network based Speech Synthesis

Author: Kim SangJin
Takaki Shinji
Yamagishi Junichi
Publication venue: 'International Speech Communication Association'
Publication date: 15/09/2016
Field of study

Edinburgh Research Explorer

Multiple Feed-forward Deep Neural Networks for Statistical Parametric Speech Synthesis

Author: Kim JongJin
Kim SangJin
Takaki Shinji
Yamagishi Junichi
Publication venue
Publication date: 01/01/2015
Field of study

Edinburgh Research Explorer

Direct Modelling of Magnitude and Phase Spectra for Statistical Parametric Speech Synthesis

Author: Espic calderón Felipe
King Simon
Valentini Botinhao Cassia
Publication venue: 'International Speech Communication Association'
Publication date: 20/08/2017
Field of study

Crossref

Edinburgh Research Explorer

Reducing mismatch in training of DNN-based glottal excitation models in a statistical parametric text-to-speech system

Author: Alku Paavo
Bollepalli Bajibabu
Juvela Lauri
Yamagishi Junichi
Publication venue: 'International Speech Communication Association'
Publication date: 01/08/2017
Field of study

Neural network-based models that generate glottal excitation waveforms from acoustic features have been found to give improved quality in statistical parametric speech synthesis. Until now, however, these models have been trained separately from the acoustic model. This creates mismatch between training and synthesis, as the synthesized acoustic features used for the excitation model input differ from the original inputs, with which the model was trained on. Furthermore, due to the errors in predicting the vocal tract filter, the original excitation waveforms do not provide perfect reconstruction of the speech waveform even if predicted without error. To address these issues and to make the excitation model more robust against errors in acoustic modeling, this paper proposes two modifications to the excitation model training scheme. First, the excitation model is trained in a connected manner, with inputs generated by the acoustic model. Second, the target glottal waveforms are re-estimated by performing glottal inverse filtering with the predicted vocal tract filters. The results show that both of these modifications improve performance measured in MSE and MFCC distortion, and slightly improve the subjective quality of the synthetic speech.Peer reviewe

Crossref

Edinburgh Research Explorer

Aaltodoc Publication Archive

A Deep Generative Architecture for Postfiltering in Statistical Parametric Speech Synthesis

Author: Chen Ling-Hui
Ling Zhen-Hua
Raitio Tuomo
Valentini-Botinhao Cassia
Yamagishi Junichi
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/07/2015
Field of study

Crossref

Edinburgh Research Explorer