194 research outputs found
Singing Voice Synthesis Based on a Musical Note Position-Aware Attention Mechanism
This paper proposes a novel sequence-to-sequence (seq2seq) model with a
musical note position-aware attention mechanism for singing voice synthesis
(SVS). A seq2seq modeling approach that can simultaneously perform acoustic and
temporal modeling is attractive. However, due to the difficulty of the temporal
modeling of singing voices, many recent SVS systems with an
encoder-decoder-based model still rely on explicitly on duration information
generated by additional modules. Although some studies perform simultaneous
modeling using seq2seq models with an attention mechanism, they have
insufficient robustness against temporal modeling. The proposed attention
mechanism is designed to estimate the attention weights by considering the
rhythm given by the musical score. Furthermore, several techniques are also
introduced to improve the modeling performance of the singing voice.
Experimental results indicated that the proposed model is effective in terms of
both naturalness and robustness of timing.Comment: 5 pages, 4 figures, 2 tables, submitted to ICASSP 202
Performance Evaluation of The Speaker-Independent HMM-based Speech Synthesis System "HTS-2007" for the Blizzard Challenge 2007
This paper describes a speaker-independent/adaptive HMM-based speech synthesis system developed for the Blizzard Challenge 2007. The new system, named HTS-2007, employs speaker adaptation (CSMAPLR+MAP), feature-space adaptive training, mixed-gender modeling, and full-covariance modeling using CSMAPLR transforms, in addition to several other techniques that have proved effective in our previous systems. Subjective evaluation results show that the new system generates significantly better quality synthetic speech than that of speaker-dependent approaches with realistic amounts of speech data, and that it bears comparison with speaker-dependent approaches even when large amounts of speech data are available
Unsupervised Cross-lingual Speaker Adaptation for HMM-based Speech Synthesis
In the EMIME project, we are developing a mobile device that performs personalized speech-to-speech translation such that a user's spoken input in one language is used to produce spoken output in another language, while continuing to sound like the user's voice. We integrate two techniques, unsupervised adaptation for HMM-based TTS using a word-based large-vocabulary continuous speech recognizer and cross-lingual speaker adaptation for HMM-based TTS, into a single architecture. Thus, an unsupervised cross-lingual speaker adaptation system can be developed. Listening tests show very promising results, demonstrating that adapted voices sound similar to the target speaker and that differences between supervised and unsupervised cross-lingual speaker adaptation are small
The HTS-2008 System: Yet Another Evaluation of the Speaker-Adaptive HMM-based Speech Synthesis System in The 2008 Blizzard Challenge
For the 2008 Blizzard Challenge, we used the same speaker-adaptive approach to HMM-based speech synthesis that was used in the HTS entry to the 2007 challenge, but an improved system was built in which the multi-accented English average voice model was trained on 41 hours of speech data with high-order mel-cepstral analysis using an efficient forward-backward algorithm for the HSMM. The listener evaluation scores for the synthetic speech generated from this system was much better than in 2007: the system had the equal best naturalness on the small English data set and the equal best intelligibility on both small and large data sets for English, and had the equal best naturalness on the Mandarin data. In fact, the English system was found to be as intelligible as human speech
- …