Search CORE

107 research outputs found

Thousands of Voices for HMM-Based Speech Synthesis-Analysis and Application of TTS Systems Built on Various ASR Corpora

Author: Dines John
Guan Yong
Hu Rile
Karhila Reima
King Simon
Kurimo Mikko
Oura Keiichiro
Tian Jilei
Tokuda Keiichi
Usabaev Bela
Watts Oliver
Wu Yi-Jian
Yamagishi Junichi
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/07/2010
Field of study

In conventional speech synthesis, large amounts of phonetically balanced speech data recorded in highly controlled recording studio environments are typically required to build a voice. Although using such data is a straightforward solution for high quality synthesis, the number of voices available will always be limited, because recording costs are high. On the other hand, our recent experiments with HMM-based speech synthesis systems have demonstrated that speaker-adaptive HMM-based speech synthesis (which uses an "average voice model" plus model adaptation) is robust to non-ideal speech data that are recorded under various conditions and with varying microphones, that are not perfectly clean, and/or that lack phonetic balance. This enables us to consider building high-quality voices on "non-TTS" corpora such as ASR corpora. Since ASR corpora generally include a large number of speakers, this leads to the possibility of producing an enormous number of voices automatically. In this paper, we demonstrate the thousands of voices for HMM-based speech synthesis that we have made from several popular ASR corpora such as the Wall Street Journal (WSJ0, WSJ1, and WSJCAM0), Resource Management, Globalphone, and SPEECON databases. We also present the results of associated analysis based on perceptual evaluation, and discuss remaining issues

Edinburgh Research Archive

Edinburgh Research Explorer

Analysis of Unsupervised and Noise-Robust Speaker-Adaptive HMM-Based Speech Synthesis Systems toward a Unified ASR and TTS Framework

Author: Dines John
Gibson Matthew
Guan Yong
King Simon
Lincoln Mike
Tian Jilei
Yamagishi Junichi
Publication venue
Publication date: 01/01/2009
Field of study

For the 2009 Blizzard Challenge we have built an unsupervised version of the HTS-2008 speaker-adaptive HMM-based speech synthesis system for English, and a noise robust version of the systems for Mandarin. They are designed from a multidisciplinary application point of view in that we attempt to integrate the components of the TTS system with other technologies such as ASR. All the average voice models are trained exclusively from recognized, publicly available, ASR databases. Multi-pass LVCSR and confidence scores calculated from confusion network are used for the unsupervised systems, and noisy data recorded in cars or public spaces is used for the noise robust system. We believe the developed systems form solid benchmarks and provide good connections to ASR fields. This paper describes the development of the systems and reports the results and analysis of their evaluation

CiteSeerX

Edinburgh Research Archive

Edinburgh Research Explorer

Thousands of voices for HMM-based speech synthesis

Author: Dines J.
Guan Y.
Hu R.
Karhila Reima
King S.
Kurimo Mikko
Oura K.
Tian J.
Tokuda K.
Usabaev B.
Watts Oliver
Yamagishi J.
Publication venue
Publication date: 01/09/2009
Field of study

Edinburgh Research Explorer

Speech Synthesis Based on Hidden Markov Models

Author: Nankaku Y.
Oura K.
Toda T.
Tokuda K.
Yamagishi J.
Zen H.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/05/2013
Field of study

Edinburgh Research Explorer

Synthetic Speech Discrimination using Pitch Pattern Statistics Derived from Image Analysis

Author: Leon Phillip L. De
Stewart Bryan
Yamagishi Junichi
Publication venue
Publication date: 01/09/2012
Field of study

Edinburgh Research Explorer

Roles of the Average Voice in Speaker-adaptive HMM-based Speech Synthesis

Author: King Simon
Usabaev Bela
Watts Oliver
Yamagishi Junichi
Publication venue
Publication date: 01/01/2010
Field of study

In speaker-adaptive HMM-based speech synthesis, there are typically a few speakers for which the output synthetic speech sounds worse than that of other speakers, despite having the same amount of adaptation data from within the same corpus. This paper investigates these fluctuations in quality and concludes that as melcepstral distance from the average voice becomes larger, the MOS naturalness scores generally become worse. Although this negative correlation is not that strong, it suggests a way to improve the training and adaptation strategies. We also draw comparisons between our findings and the work of other researchers regarding ``vocal attractiveness.'

CiteSeerX

Edinburgh Research Archive

Edinburgh Research Explorer

Template-based ASR using Posterior features and synthetic references: comparing different TTS systems

Author: Bourlard Hervé
Magimai.-Doss Mathew
Soldo Serena
Publication venue
Publication date: 19/12/2013
Field of study

In recent works, the use of phone class-conditional posterior probabilities (posterior features) directly as features provided successful results in template-based ASR systems. In this paper, motivated by the high quality of current text-to-speech systems and the robustness of posterior features toward undesired variability, we investigate the use of synthetic speech to generate reference templates. The use of synthetic speech in template-based ASR not only allows to address the issue of in-domain data collection but also expansion of vocabulary. On 75- and 600-word task-independent and speaker-independent setup of Phonebook corpus, we show the feasibility of this approach by investigating different synthetic voices produced by HTS-based synthesizer trained on two different databases. Our study shows that synthetic speech templates can yield performance comparable to the natural speech templates, especially with synthetic voices that have high intelligibility

Infoscience - École polytechnique fédérale de Lausanne

Analysis of unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis using KLD-based transform mapping

Author: King Simon
Oura Keiichiro
Tokuda Keiichi
Wester Mirjam
Yamagishi Junichi
Publication venue: 'Elsevier BV'
Publication date: 01/07/2012
Field of study

Crossref

Edinburgh Research Explorer

Evaluation of the Vulnerability of Speaker Verification to Synthetic Speech

Author: De Leon P.L.
Pucher M.
Yamagishi Junichi
Publication venue
Publication date: 01/01/2010
Field of study

In this paper, we evaluate the vulnerability of a speaker verification (SV) system to synthetic speech. Although this problem was first examined over a decade ago, dramatic improvements in both SV and speech synthesis have renewed interest in this problem. We use a HMM-based speech synthesizer, which creates synthetic speech for a targeted speaker through adaptation of a background model and a GMM-UBM-based SV system. Using 283 speakers from the Wall-Street Journal (WSJ) corpus, our SV system has a 0.4% EER. When the system is tested with synthetic speech generated from speaker models derived from the WSJ journal corpus, 90% of the matched claims are accepted. This result suggests a possible vulnerability in SV systems to synthetic speech. In order to detect synthetic speech prior to recognition, we investigate the use of an automatic speech recognizer (ASR), dynamic-timewarping (DTW) distance of mel-frequency cepstral coefficients (MFCC), and previously-proposed average inter-frame difference of log-likelihood (IFDLL). Overall, while SV systems have impressive accuracy, even with the proposed detector, high-quality synthetic speech can lead to an unacceptably high acceptance rate of synthetic speakers

Edinburgh Research Archive

Edinburgh Research Explorer