Search CORE

31 research outputs found

Multisyn voices from ARCTIC data for the Blizzard challenge.

Author: Clark Robert A J
King Simon
Richmond Korin
Publication venue: 'International Speech Communication Association'
Publication date: 01/01/2005
Field of study

This paper describes the process of building unit selection voices for the Festival Multisyn engine using four ARCTIC datasets, as part of the Blizzard evaluation challenge. The build process is almost entirely automatic, with very little need for human intervention. We discuss the difference in the evaluation results for each voice and evaluate the suitability of the ARCTIC datasets for building this type of voice

CiteSeerX

Edinburgh Research Archive

Edinburgh Research Explorer

Analysis of Unsupervised and Noise-Robust Speaker-Adaptive HMM-Based Speech Synthesis Systems toward a Unified ASR and TTS Framework

Author: Dines John
Gibson Matthew
Guan Yong
King Simon
Lincoln Mike
Tian Jilei
Yamagishi Junichi
Publication venue
Publication date: 01/01/2009
Field of study

For the 2009 Blizzard Challenge we have built an unsupervised version of the HTS-2008 speaker-adaptive HMM-based speech synthesis system for English, and a noise robust version of the systems for Mandarin. They are designed from a multidisciplinary application point of view in that we attempt to integrate the components of the TTS system with other technologies such as ASR. All the average voice models are trained exclusively from recognized, publicly available, ASR databases. Multi-pass LVCSR and confidence scores calculated from confusion network are used for the unsupervised systems, and noisy data recorded in cars or public spaces is used for the noise robust system. We believe the developed systems form solid benchmarks and provide good connections to ASR fields. This paper describes the development of the systems and reports the results and analysis of their evaluation

CiteSeerX

Edinburgh Research Archive

Edinburgh Research Explorer

Festival multisyn voices for the 2007 blizzard challenge.

Author: Clark Robert A J
Fitt Susan
Richmond Korin
Strom Volker
Yamagishi Junichi
Publication venue
Publication date: 01/01/2007
Field of study

This paper describes selected aspects of the Festival Multisyn entry to the Blizzard Challenge 2007. We provide an overview of the process of building the three required voices from the speech data provided. This paper focuses on new features of Multisyn which are currently under development and which have been employed in the system used for this Blizzard Challenge. These differences are the application of a more flexible phonetic lattice representation during forced alignment labelling and the use of a pitch accent target cost component. Finally, we also examine aspects of the speech data provided for this year's Blizzard Challenge and raise certain issues for discussion concerning the aim of comparing voices made with differing subsets of the data provided

CiteSeerX

Edinburgh Research Archive

Edinburgh Research Explorer

Multisyn Voices for the Blizzard Challenge 2006

Author: Clark R.
King S.
Richmond K.
Strom V.
Publication venue
Publication date: 01/09/2006
Field of study

Edinburgh Research Explorer

The HTS-2008 System: Yet Another Evaluation of the Speaker-Adaptive HMM-based Speech Synthesis System in The 2008 Blizzard Challenge

Author: Toda Tomoki
Tokuda Keiichi
Wu Yi-Jian
Yamagishi Junichi
Zen Heiga
Publication venue
Publication date: 01/01/2008
Field of study

For the 2008 Blizzard Challenge, we used the same speaker-adaptive approach to HMM-based speech synthesis that was used in the HTS entry to the 2007 challenge, but an improved system was built in which the multi-accented English average voice model was trained on 41 hours of speech data with high-order mel-cepstral analysis using an efficient forward-backward algorithm for the HSMM. The listener evaluation scores for the synthetic speech generated from this system was much better than in 2007: the system had the equal best naturalness on the small English data set and the equal best intelligibility on both small and large data sets for English, and had the equal best naturalness on the Mandarin data. In fact, the English system was found to be as intelligible as human speech

NAIST Academic Repository

CiteSeerX

Edinburgh Research Archive

Edinburgh Research Explorer

Robust Speaker-Adaptive HMM-based Text-to-Speech Synthesis

Author: Heiga Zen
Junichi Yamagishi
Keiichi Tokuda
Senior Member
Simon King
Steve Renals
Takashi Nose
Tomoki Toda
Zhen-hua Ling
Publication venue
Publication date: 01/01/2009
Field of study

This paper describes a speaker-adaptive HMM-based speech synthesis system. The new system, called ``HTS-2007,'' employs speaker adaptation (CSMAPLR+MAP), feature-space adaptive training, mixed-gender modeling, and full-covariance modeling using CSMAPLR transforms, in addition to several other techniques that have proved effective in our previous systems. Subjective evaluation results show that the new system generates significantly better quality synthetic speech than speaker-dependent approaches with realistic amounts of speech data, and that it bears comparison with speaker-dependent approaches even when large amounts of speech data are available. In addition, a comparison study with several speech synthesis techniques shows the new system is very robust: It is able to build voices from less-than-ideal speech data and synthesize good-quality speech even for out-of-domain sentences

CiteSeerX

Edinburgh Research Archive

Edinburgh Research Explorer

Simple methods for improving speaker-similarity of HMM-based speech synthesis

Author: King Simon
Yamagishi Junichi
Publication venue
Publication date: 01/01/2010
Field of study

In this paper we revisit some basic configuration choices of HMM based speech synthesis, such as waveform sampling rate, auditory frequency warping scale and the logarithmic scaling of F0, with the aim of improving speaker similarity which is an acknowledged weakness of current HMM-based speech synthesisers. All of the techniques investigated are simple but, as we demonstrate using perceptual tests, can make substantial differences to the quality of the synthetic speech. Contrary to common practice in automatic speech recognition, higher waveform sampling rates can offer enhanced feature extraction and improved speaker similarity for speech synthesis. In addition, a generalized logarithmic transform of F0 results in larger intra-utterance variance of F0 trajectories and hence more dynamic and natural-sounding prosody

CiteSeerX

Crossref

Edinburgh Research Archive

Edinburgh Research Explorer

Robust Speaker-Adaptive HMM-Based Text-to-Speech Synthesis

Author: Yamagishi J.
Nose T.
Zen H.
Ling Z. H.
Toda T.
Tokuda K.
King S.
Renals S.
Publication venue
Publication date: 01/08/2009
Field of study

AbstractWe present an algorithm for solving the radiative transfer problem on massively parallel computers using adaptive mesh refinement and domain decomposition. The solver is based on the method of characteristics which requires an adaptive raytracer that integrates the equation of radiative transfer. The radiation field is split into local and global components which are handled separately to overcome the non-locality problem. The solver is implemented in the framework of the magneto-hydrodynamics code FLASH and is coupled by an operator splitting step. The goal is the study of radiation in the context of star formation simulations with a focus on early disc formation and evolution. This requires a proper treatment of radiation physics that covers both the optically thin as well as the optically thick regimes and the transition region in particular. We successfully show the accuracy and feasibility of our method in a series of standard radiative transfer problems and two 3D collapse simulations resembling the early stages of protostar and disc formation

Elsevier - Publisher Connector

Edinburgh Research Explorer

MPG.PuRe