6 research outputs found
Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech
Polyphone disambiguation aims to capture accurate pronunciation knowledge
from natural text sequences for reliable Text-to-speech (TTS) systems. However,
previous approaches require substantial annotated training data and additional
efforts from language experts, making it difficult to extend high-quality
neural TTS systems to out-of-domain daily conversations and countless languages
worldwide. This paper tackles the polyphone disambiguation problem from a
concise and novel perspective: we propose Dict-TTS, a semantic-aware generative
text-to-speech model with an online website dictionary (the existing prior
information in the natural language). Specifically, we design a
semantics-to-pronunciation attention (S2PA) module to match the semantic
patterns between the input text sequence and the prior semantics in the
dictionary and obtain the corresponding pronunciations; The S2PA module can be
easily trained with the end-to-end TTS model without any annotated phoneme
labels. Experimental results in three languages show that our model outperforms
several strong baseline models in terms of pronunciation accuracy and improves
the prosody modeling of TTS systems. Further extensive analyses demonstrate
that each design in Dict-TTS is effective. The code is available at
\url{https://github.com/Zain-Jiang/Dict-TTS}.Comment: Accepted by NeurIPS 202
Good Neighbors Are All You Need for Chinese Grapheme-to-Phoneme Conversion
Most Chinese Grapheme-to-Phoneme (G2P) systems employ a three-stage framework
that first transforms input sequences into character embeddings, obtains
linguistic information using language models, and then predicts the phonemes
based on global context about the entire input sequence. However, linguistic
knowledge alone is often inadequate. Language models frequently encode overly
general structures of a sentence and fail to cover specific cases needed to use
phonetic knowledge. Also, a handcrafted post-processing system is needed to
address the problems relevant to the tone of the characters. However, the
system exhibits inconsistency in the segmentation of word boundaries which
consequently degrades the performance of the G2P system. To address these
issues, we propose the Reinforcer that provides strong inductive bias for
language models by emphasizing the phonological information between neighboring
characters to help disambiguate pronunciations. Experimental results show that
the Reinforcer boosts the cutting-edge architectures by a large margin. We also
combine the Reinforcer with a large-scale pre-trained model and demonstrate the
validity of using neighboring context in knowledge transfer scenarios.Comment: Accepted to ICASSP 202
Multi-Module G2P Converter for Persian Focusing on Relations between Words
In this paper, we investigate the application of end-to-end and multi-module
frameworks for G2P conversion for the Persian language. The results demonstrate
that our proposed multi-module G2P system outperforms our end-to-end systems in
terms of accuracy and speed. The system consists of a pronunciation dictionary
as our look-up table, along with separate models to handle homographs, OOVs and
ezafe in Persian created using GRU and Transformer architectures. The system is
sequence-level rather than word-level, which allows it to effectively capture
the unwritten relations between words (cross-word information) necessary for
homograph disambiguation and ezafe recognition without the need for any
pre-processing. After evaluation, our system achieved a 94.48% word-level
accuracy, outperforming the previous G2P systems for Persian.Comment: 10 pages, 4 figure
Pronunciation modelling in end-to-end text-to-speech synthesis
Sequence-to-sequence (S2S) models in text-to-speech synthesis (TTS) can achieve
high-quality naturalness scores without extensive processing of text-input. Since S2S
models have been proposed in multiple aspects of the TTS pipeline, the field has focused
on embedding the pipeline toward End-to-End (E2E-) TTS where a waveform
is predicted directly from a sequence of text or phone characters. Early work on E2ETTS
in English, such as Char2Wav [1] and Tacotron [2], suggested that phonetisation
(lexicon-lookup and/or G2P modelling) could be implicitly learnt in a text-encoder
during training. The benefits of a learned text encoding include improved modelling
of phonetic context, which make contextual linguistic features traditionally used in
TTS pipelines redundant [3]. Subsequent work on E2E-TTS has since shown similar
naturalness scores with text- or phone-input (e.g. as in [4]). Successful modelling
of phonetic context has led some to question the benefit of using phone- instead of
text-input altogether (see [5]).
The use of text-input brings into question the value of the pronunciation lexicon
in E2E-TTS. Without phone-input, a S2S encoder learns an implicit grapheme-tophoneme
(G2P) model from text-audio pairs during training. With common datasets
for E2E-TTS in English, I simulated implicit G2P models, finding increased error rates
compared to a traditional, lexicon-based G2P model. Ultimately, successful G2P generalisation
is difficult for some words (e.g. foreign words and proper names) since
the knowledge to disambiguate their pronunciations may not be provided by the local
grapheme context and may require knowledge beyond that contained in sentence-level
text-audio sequences. When test stimuli were selected according to G2P difficulty,
increased mispronunciations in E2E-TTS with text-input were observed. Following
the proposed benefits of subword decomposition in S2S modelling in other language
tasks (e.g. neural machine translation), the effects of morphological decomposition
were investigated on pronunciation modelling. Learning of the French post-lexical
phenomenon liaison was also evaluated.
With the goal of an inexpensive, large-scale evaluation of pronunciation modelling,
the reliability of automatic speech recognition (ASR) to measure TTS intelligibility
was investigated. A re-evaluation of 6 years of results from the Blizzard Challenge
was conducted. ASR reliably found similar significant differences between systems
as paid listeners in controlled conditions in English. An analysis of transcriptions for
words exhibiting difficult-to-predict G2P relations was also conducted. The E2E-ASR
Transformer model used was found to be unreliable in its transcription of difficult G2P
relations due to homophonic transcription and incorrect transcription of words with
difficult G2P relations. A further evaluation of representation mixing in Tacotron finds
pronunciation correction is possible when mixing text- and phone-inputs. The thesis
concludes that there is still a place for the pronunciation lexicon in E2E-TTS as a
pronunciation guide since it can provide assurances that G2P generalisation cannot
Label Imputation for Homograph Disambiguation: Theoretical and Practical Approaches
This dissertation presents the first implementation of label imputation for the task of homograph disambiguation using 1) transcribed audio, and 2) parallel, or translated, corpora. For label imputation from parallel corpora, a hypothesis of interlingual alignment between homograph pronunciations and text word forms is developed and formalized. Both audio and parallel corpora label imputation techniques are tested empirically in experiments that compare homograph disambiguation model performance using: 1) hand-labeled training data, and 2) hand-labeled training data augmented with label-imputed data. Regularized, multinomial logistic regression and pre-trained ALBERT, BERT, and XLNet language models fine-tuned as token classifiers are developed for homograph disambiguation. Model performance after training on parallel corpus-based, label-imputed augmented data shows improvement over training on hand-labeled data alone in classes with low prevalence samples. Four homograph disambiguation data sets generated during the work on the dissertation are made available to the research community. In addition, this dissertation offers a novel typology of homographs with practical implications for both the label imputation process and homograph disambiguation