Search CORE

573 research outputs found

Improving TTS with corpus-specific pronunciation adaptation

Author: Lecorvé Gwénolé
Lolive Damien
Qader Raheel
Tahon Marie
Publication venue: HAL CCSD
Publication date: 08/09/2016
Field of study

International audienceText-to-speech (TTS) systems are built on speech corpora which are labeled with carefully checked and segmented phonemes. However, phoneme sequences generated by automatic grapheme-to-phoneme converters during synthesis are usually inconsistent with those from the corpus, thus leading to poor quality synthetic speech signals. To solve this problem , the present work aims at adapting automatically generated pronunciations to the corpus. The main idea is to train corpus-specific phoneme-to-phoneme conditional random fields with a large set of linguistic, phonological, articulatory and acoustic-prosodic features. Features are first selected in cross-validation condition, then combined to produce the final best feature set. Pronunciation models are evaluated in terms of phoneme error rate and through perceptual tests. Experiments carried out on a French speech corpus show an improvement in the quality of speech synthesis when pronunciation models are included in the phonetization process. Appart from improving TTS quality, the presented pronunciation adaptation method also brings interesting perspectives in terms of expressive speech synthesis

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Rennes 1

English Down Under: Popular or neglected?

Author: Nowacka Marta
Webb Beata
Publication venue
Publication date: 01/12/2013
Field of study

Bond University Research Portal

AdaVITS: Tiny VITS for Low Computing Resource Speaker Adaptation

Author: Cong Jian
Song Kun
Su Dan
Wang Xinsheng
Xie Lei
Xue Heyang
Yang Bing
Zhang Xiong
Zhang Yongmao
Publication venue
Publication date: 02/11/2022
Field of study

Speaker adaptation in text-to-speech synthesis (TTS) is to finetune a pre-trained TTS model to adapt to new target speakers with limited data. While much effort has been conducted towards this task, seldom work has been performed for low computational resource scenarios due to the challenges raised by the requirement of the lightweight model and less computational complexity. In this paper, a tiny VITS-based TTS model, named AdaVITS, for low computing resource speaker adaptation is proposed. To effectively reduce parameters and computational complexity of VITS, an iSTFT-based wave construction decoder is proposed to replace the upsampling-based decoder which is resource-consuming in the original VITS. Besides, NanoFlow is introduced to share the density estimate across flow blocks to reduce the parameters of the prior encoder. Furthermore, to reduce the computational complexity of the textual encoder, scaled-dot attention is replaced with linear attention. To deal with the instability caused by the simplified model, instead of using the original text encoder, phonetic posteriorgram (PPG) is utilized as linguistic feature via a text-to-PPG module, which is then used as input for the encoder. Experiment shows that AdaVITS can generate stable and natural speech in speaker adaptation with 8.97M model parameters and 0.72GFlops computational complexity.Comment: Accepted by ISCSLP 202

arXiv.org e-Print Archive

Introducing nativization to Spanish TTS systems

Author: Antonio Bonafonte
Bellegarda
Bisani
Brill
Conde
Dedina
Fox
Glushko
Hammond
Ladefoged
Marchand
Raynolds
Soonklang
Swan
Tatyana Polyákova
Wells
Yavas
Zheng
Publication venue: 'Elsevier BV'
Publication date: 01/06/2011
Field of study

In the modern world, speech technologies must be flexible and adaptable to any framework. Mass media globalization introduces multilingualism as a challenge for the most popular speech applications such as text-to-speech synthesis and automatic speech recognition. Mixed-language texts vary in their nature and when processed, some essential characteristics must be considered. In Spain and other Spanish-speaking countries, the use of Anglicisms and other words of foreign origin is constantly growing. A particularity of peninsular Spanish is that there is a tendency to nativize the pronunciation of non-Spanish words so that they fit properly into Spanish phonetic patterns. In our previous work, we proposed to use hand-crafted nativization tables that were capable of nativizing correctly 24% of words from the test data. In this work, our goal was to approach the nativization challenge by data-driven methods, because they are transferable to other languages and do not drop in performance in comparison with explicit rules manually written by experts. Training and test corpora for nativization consisted of 1000 and 100 words respectively and were crafted manually. Different specifications of nativization by analogy and learning from errors focused on finding the best nativized pronunciation of foreign words. The best obtained objective nativization results showed an improvement from 24% to 64% in word accuracy in comparison to our previous work. Furthermore, a subjective evaluation of the synthesized speech allowed for the conclusion that nativization by analogy is clearly the preferred method among listeners of different backgrounds when comparing to previously proposed methods. These results were quite encouraging and proved that even a small training corpus is sufficient for achieving significant improvements in naturalness for English inclusions of variable length in Spanish utterances.Peer ReviewedPostprint (published version

Crossref

UPCommons. Portal del coneixement obert de la UPC

VOICECONET: A Collaborative Framework for Speech-Based Computer Accessibility with a Case Study for Brazilian Portuguese

Author: Aldebaro Klautau
Nelson Neto
Pedro Batista
Publication venue: 'IntechOpen'
Publication date: 28/11/2012
Field of study

IntechOpen