    Data-driven grapheme-to-phoneme representations for a lexicon-free text-to-speech

    Grapheme-to-Phoneme (G2P) is an essential first step in any modern, high-quality Text-to-Speech (TTS) system. Most of the current G2P systems rely on carefully hand-crafted lexicons developed by experts. This poses a two-fold problem. Firstly, the lexicons are generated using a fixed phoneme set, usually, ARPABET or IPA, which might not be the most optimal way to represent phonemes for all languages. Secondly, the man-hours required to produce such an expert lexicon are very high. In this paper, we eliminate both of these issues by using recent advances in self-supervised learning to obtain data-driven phoneme representations instead of fixed representations. We compare our lexicon-free approach against strong baselines that utilize a well-crafted lexicon. Furthermore, we show that our data-driven lexicon-free method performs as good or even marginally better than the conventional rule-based or lexicon-based neural G2Ps in terms of Mean Opinion Score (MOS) while using no prior language lexicon or phoneme set, i.e. no linguistic expertise.Comment: Accepted at ICASSP 202

    Mitigating the Exposure Bias in Sentence-Level Grapheme-to-Phoneme (G2P) Transduction

    Text-to-Text Transfer Transformer (T5) has recently been considered for the Grapheme-to-Phoneme (G2P) transduction. As a follow-up, a tokenizer-free byte-level model based on T5 referred to as ByT5, recently gave promising results on word-level G2P conversion by representing each input character with its corresponding UTF-8 encoding. Although it is generally understood that sentence-level or paragraph-level G2P can improve usability in real-world applications as it is better suited to perform on heteronyms and linking sounds between words, we find that using ByT5 for these scenarios is nontrivial. Since ByT5 operates on the character level, it requires longer decoding steps, which deteriorates the performance due to the exposure bias commonly observed in auto-regressive generation models. This paper shows that the performance of sentence-level and paragraph-level G2P can be improved by mitigating such exposure bias using our proposed loss-based sampling method.Comment: INTERSPEECH 202

    Multi-Module G2P Converter for Persian Focusing on Relations between Words

    In this paper, we investigate the application of end-to-end and multi-module frameworks for G2P conversion for the Persian language. The results demonstrate that our proposed multi-module G2P system outperforms our end-to-end systems in terms of accuracy and speed. The system consists of a pronunciation dictionary as our look-up table, along with separate models to handle homographs, OOVs and ezafe in Persian created using GRU and Transformer architectures. The system is sequence-level rather than word-level, which allows it to effectively capture the unwritten relations between words (cross-word information) necessary for homograph disambiguation and ezafe recognition without the need for any pre-processing. After evaluation, our system achieved a 94.48% word-level accuracy, outperforming the previous G2P systems for Persian.Comment: 10 pages, 4 figure

    Towards a unified model for speech and language processing

    Ce travail de recherche explore les méthodes d’apprentissage profond de la parole et du langage, y inclus la reconnaissance et la synthèse de la parole, la conversion des graphèmes en phonèmes et vice-versa, les modèles génératifs, visant de reformuler des tâches spécifiques dans un problème plus général de trouver une représentation universelle d’information contenue dans chaque modalité et de transférer un signal d’une modalité à une autre en se servant de telles représentations universelles et à générer des représentations dans plusieurs modalités. Il est compris de deux projets de recherche: 1) SoundChoice, un modèle graphème-phonème tenant compte du contexte au niveau de la phrase qui réalise de bonnes performances et des améliorations remarquables comparativement à un modèle de base et 2) MAdmixture, une nouvelle approche pour apprendre des représentations multimodales dans un espace latent commun.The present work explores the use of deep learning methods applied to a variety of areas in speech and language processing including speech recognition, grapheme-to-phoneme conversion, speech synthesis, generative models for speech and others to build toward a unified approach that reframes these individual tasks into a more general problem of finding a universal representation of information encoded in different modalities and being able to seamlessly transfer a signal from one modality to another by converting it to this universal representations and to generate samples in multiple modalities. It consists of two main research projects: 1) SoundChocice, a context-aware sentence level Grapheme-to-Phoneme model achieving solid performance on the task and a significant improvement on phoneme disambiguation over baseline models and 2) MAdmixture, a novel approach to learning a variety of speech representations in a common latent space
