102 research outputs found
Robust Speaker-Adaptive HMM-based Text-to-Speech Synthesis
This paper describes a speaker-adaptive HMM-based speech synthesis system. The new system, called ``HTS-2007,'' employs speaker adaptation (CSMAPLR+MAP), feature-space adaptive training, mixed-gender modeling, and full-covariance modeling using CSMAPLR transforms, in addition to several other techniques that have proved effective in our previous systems. Subjective evaluation results show that the new system generates significantly better quality synthetic speech than speaker-dependent approaches with realistic amounts of speech data, and that it bears comparison with speaker-dependent approaches even when large amounts of speech data are available. In addition, a comparison study with several speech synthesis techniques shows the new system is very robust: It is able to build voices from less-than-ideal speech data and synthesize good-quality speech even for out-of-domain sentences
Analysis of Speaker Adaptation Algorithms for HMM-based Speech Synthesis and a Constrained SMAPLR Adaptation Algorithm
In this paper we analyze the effects of several factors and configuration choices encountered during training and model construction when we want to obtain better and more stable adaptation in HMM-based speech synthesis. We then propose a new adaptation algorithm called constrained structural maximum a posteriori linear regression (CSMAPLR) whose derivation is based on the knowledge obtained in this analysis and on the results of comparing several conventional adaptation algorithms. Here we investigate six major aspects of the speaker adaptation: initial models transform functions, estimation criteria, and sensitivity of several linear regression adaptation algorithms algorithms. Analyzing the effect of the initial model, we compare speaker-dependent models, gender-independent models, and the simultaneous use of the gender-dependent models to single use of the gender-dependent models. Analyzing the effect of the transform functions, we compare the transform function for only mean vectors with that for mean vectors and covariance matrices. Analyzing the effect of the estimation criteria, we compare the ML criterion with a robust estimation criterion called structural MAP. We evaluate the sensitivity of several thresholds for the piecewise linear regression algorithms and take up methods combining MAP adaptation with the linear regression algorithms. We incorporate these adaptation algorithms into our speech synthesis system and present several subjective and objective evaluation results showing the utility and effectiveness of these algorithms in speaker adaptation for HMM-based speech synthesis
The HTS-2008 System: Yet Another Evaluation of the Speaker-Adaptive HMM-based Speech Synthesis System in The 2008 Blizzard Challenge
For the 2008 Blizzard Challenge, we used the same speaker-adaptive approach to HMM-based speech synthesis that was used in the HTS entry to the 2007 challenge, but an improved system was built in which the multi-accented English average voice model was trained on 41 hours of speech data with high-order mel-cepstral analysis using an efficient forward-backward algorithm for the HSMM. The listener evaluation scores for the synthetic speech generated from this system was much better than in 2007: the system had the equal best naturalness on the small English data set and the equal best intelligibility on both small and large data sets for English, and had the equal best naturalness on the Mandarin data. In fact, the English system was found to be as intelligible as human speech
Performance Evaluation of The Speaker-Independent HMM-based Speech Synthesis System "HTS-2007" for the Blizzard Challenge 2007
This paper describes a speaker-independent/adaptive HMM-based speech synthesis system developed for the Blizzard Challenge 2007. The new system, named HTS-2007, employs speaker adaptation (CSMAPLR+MAP), feature-space adaptive training, mixed-gender modeling, and full-covariance modeling using CSMAPLR transforms, in addition to several other techniques that have proved effective in our previous systems. Subjective evaluation results show that the new system generates significantly better quality synthetic speech than that of speaker-dependent approaches with realistic amounts of speech data, and that it bears comparison with speaker-dependent approaches even when large amounts of speech data are available
Robust Speaker-Adaptive HMM-Based Text-to-Speech Synthesis
AbstractWe present an algorithm for solving the radiative transfer problem on massively parallel computers using adaptive mesh refinement and domain decomposition. The solver is based on the method of characteristics which requires an adaptive raytracer that integrates the equation of radiative transfer. The radiation field is split into local and global components which are handled separately to overcome the non-locality problem. The solver is implemented in the framework of the magneto-hydrodynamics code FLASH and is coupled by an operator splitting step. The goal is the study of radiation in the context of star formation simulations with a focus on early disc formation and evolution. This requires a proper treatment of radiation physics that covers both the optically thin as well as the optically thick regimes and the transition region in particular. We successfully show the accuracy and feasibility of our method in a series of standard radiative transfer problems and two 3D collapse simulations resembling the early stages of protostar and disc formation
Improved Average-Voice-based Speech Synthesis Using Gender-Mixed Modeling and a Parameter Generation Algorithm Considering GV
For constructing a speech synthesis system which can achieve
diverse voices, we have been developing a speaker independent
approach of HMM-based speech synthesis in which statistical
average voice models are adapted to a target speaker using a
small amount of speech data. In this paper, we incorporate a
high-quality speech vocoding method STRAIGHT and a parameter
generation algorithm with global variance into the system
for improving quality of synthetic speech. Furthermore, we
introduce a feature-space speaker adaptive training algorithm
and a gender mixed modeling technique for conducting further
normalization of the average voice model. We build an English
text-to-speech system using these techniques and show the performance
of the system
Evaluation of the Vulnerability of Speaker Verification to Synthetic Speech
In this paper, we evaluate the vulnerability of a speaker verification
(SV) system to synthetic speech. Although this problem
was first examined over a decade ago, dramatic improvements
in both SV and speech synthesis have renewed interest in
this problem. We use a HMM-based speech synthesizer, which
creates synthetic speech for a targeted speaker through adaptation
of a background model and a GMM-UBM-based SV system.
Using 283 speakers from the Wall-Street Journal (WSJ)
corpus, our SV system has a 0.4% EER. When the system
is tested with synthetic speech generated from speaker models
derived from the WSJ journal corpus, 90% of the matched
claims are accepted. This result suggests a possible vulnerability
in SV systems to synthetic speech. In order to detect
synthetic speech prior to recognition, we investigate the
use of an automatic speech recognizer (ASR), dynamic-timewarping
(DTW) distance of mel-frequency cepstral coefficients
(MFCC), and previously-proposed average inter-frame difference
of log-likelihood (IFDLL). Overall, while SV systems
have impressive accuracy, even with the proposed detector,
high-quality synthetic speech can lead to an unacceptably high
acceptance rate of synthetic speakers
Speaker-Independent HMM-based Speech Synthesis System
This paper describes an HMM-based speech synthesis system
developed by the HTS working group for the Blizzard Challenge
2007. To further explore the potential of HMM-based
speech synthesis, we incorporate new features in our conventional
system which underpin a speaker-independent approach:
speaker adaptation techniques; adaptive training for HSMMs;
and full covariance modeling using the CSMAPLR transforms
- …