1,020 research outputs found
New Grapheme Generation Rules for Two-Stage Modelbased Grapheme-to-Phoneme Conversion
The precise conversion of arbitrary text into its corresponding phoneme sequence (grapheme-to-phoneme or G2P conversion) is implemented in speech synthesis and recognition, pronunciation learning software, spoken term detection and spoken document retrieval systems. Because the quality of this module plays an important role in the performance of such systems and many problems regarding G2P conversion have been reported, we propose a novel two-stage model-based approach, which is implemented using an existing weighted finite-state transducer-based G2P conversion framework, to improve the performance of the G2P conversion model. The first-stage model is built for automatic conversion of words to phonemes, while the second-stage model utilizes the input graphemes and output phonemes obtained from the first stage to determine the best final output phoneme sequence. Additionally, we designed new grapheme generation rules, which enable extra detail for the vowel and consonant graphemes appearing within a word. When compared with previous approaches, the evaluation results indicate that our approach using rules focusing on the vowel graphemes slightly improved the accuracy of the out-of-vocabulary dataset and consistently increased the accuracy of the in-vocabulary dataset
Comparison between rule-based and data-driven natural language processing algorithms for Brazilian Portuguese speech synthesis
Due to the exponential growth in the use of computers, personal digital assistants and smartphones, the development of Text-to-Speech (TTS) systems have become highly demanded during the last years. An important part of these systems is the Text Analysis block, that converts the input text into linguistic specifications that are going to be used to generate the final speech waveform. The Natural Language Processing algorithms presented in this block are crucial to the quality of the speech generated by synthesizers. These algorithms are responsible for important tasks such as Grapheme-to-Phoneme Conversion, Syllabification and Stress Determination. For Brazilian Portuguese (BP), solutions for the algorithms presented in the Text Analysis block have been focused in rule-based approaches. These algorithms perform well for BP but have many disadvantages. On the other hand, there is still no research to evaluate and analyze the performance of data-driven approaches that reach state-of-the-art results for complex languages, such as English. So, in this work, we compare different data-driven approaches and rule-based approaches for NLP algorithms presented in a TTS system. Moreover, we propose, as a novel application, the use of Sequence-to-Sequence models as solution for the Syllabification and Stress Determination problems. As a brief summary of the results obtained, we show that data-driven algorithms can achieve state-of-the-art performance for the NLP algorithms presented in the Text Analysis block of a BP TTS system.Nos últimos anos, devido ao grande crescimento no uso de computadores, assistentes pessoais e smartphones, o desenvolvimento de sistemas capazes de converter texto em fala tem sido bastante demandado. O bloco de análise de texto, onde o texto de entrada é convertido em especificações linguísticas usadas para gerar a onda sonora final é uma parte importante destes sistemas. O desempenho dos algoritmos de Processamento de Linguagem Natural (NLP) presentes neste bloco é crucial para a qualidade dos sintetizadores de voz. Conversão Grafema-Fonema, separação silábica e determinação da sílaba tônica são algumas das tarefas executadas por estes algoritmos. Para o Português Brasileiro (BP), os algoritmos baseados em regras têm sido o foco na solução destes problemas. Estes algoritmos atingem bom desempenho para o BP, contudo apresentam diversas desvantagens. Por outro lado, ainda não há pesquisa no intuito de avaliar o desempenho de algoritmos data-driven, largamente utilizados para línguas complexas, como o inglês. Desta forma, expõe-se neste trabalho uma comparação entre diferentes técnicas data-driven e baseadas em regras para algoritmos de NLP utilizados em um sintetizador de voz. Além disso, propõe o uso de Sequence-to-Sequence models para a separação silábica e a determinação da tonicidade. Em suma, o presente trabalho demonstra que o uso de algoritmos data-driven atinge o estado-da-arte na performance dos algoritmos de Processamento de Linguagem Natural de um sintetizador de voz para o Português Brasileiro
Mitigating the Exposure Bias in Sentence-Level Grapheme-to-Phoneme (G2P) Transduction
Text-to-Text Transfer Transformer (T5) has recently been considered for the
Grapheme-to-Phoneme (G2P) transduction. As a follow-up, a tokenizer-free
byte-level model based on T5 referred to as ByT5, recently gave promising
results on word-level G2P conversion by representing each input character with
its corresponding UTF-8 encoding. Although it is generally understood that
sentence-level or paragraph-level G2P can improve usability in real-world
applications as it is better suited to perform on heteronyms and linking sounds
between words, we find that using ByT5 for these scenarios is nontrivial. Since
ByT5 operates on the character level, it requires longer decoding steps, which
deteriorates the performance due to the exposure bias commonly observed in
auto-regressive generation models. This paper shows that the performance of
sentence-level and paragraph-level G2P can be improved by mitigating such
exposure bias using our proposed loss-based sampling method.Comment: INTERSPEECH 202
Methods of testing and diagnosing model error : dual and single route cascaded models of reading aloud
Models of visual word recognition have been assessed by both factorial and regression approaches. Factorial approaches tend to provide a relatively weak test of models, and regression approaches give little indication of the sources of models’ mispredictions, especially when parameters are not optimal. A new alternative method, involving regression on model error, combines these two approaches with parameter optimization. The method is illustrated with respect to the dual route cascaded model of reading aloud. In contrast to previous investigations, this method provides clear evidence that there are parameter-independent problems with the model, and identifies two specific sources of misprediction made by model
A Comparison of Different Machine Transliteration Models
Machine transliteration is a method for automatically converting words in one
language into phonetically equivalent ones in another language. Machine
transliteration plays an important role in natural language applications such
as information retrieval and machine translation, especially for handling
proper nouns and technical terms. Four machine transliteration models --
grapheme-based transliteration model, phoneme-based transliteration model,
hybrid transliteration model, and correspondence-based transliteration model --
have been proposed by several researchers. To date, however, there has been
little research on a framework in which multiple transliteration models can
operate simultaneously. Furthermore, there has been no comparison of the four
models within the same framework and using the same data. We addressed these
problems by 1) modeling the four models within the same framework, 2) comparing
them under the same conditions, and 3) developing a way to improve machine
transliteration through this comparison. Our comparison showed that the hybrid
and correspondence-based models were the most effective and that the four
models can be used in a complementary manner to improve machine transliteration
performance
DATA-BASE RULE-SYSTEM FOR THE MULTIVOX TEXT-TO-SPEECH CONVERTER APPLICATION FOR ARABIC LANGUAGE
The MULTIVOX-Multilingual text-to-speech converter system is adapted to Modern
Standard Arabic. In this system, Arabic speech is generated from the concatenation
of a set of acoustic building units (ABUs). A 3-dimensional data-base rule-system for the
synthesis of unlimited vocabulary Arabic text is organized to concatenate the appropri-
ate ABUs for all possible phone-code pairs that may exist in the input text. The main
functions of the MULTIVOX are explained. Illustrative examples are given to show the
conversion of Arabic graphemes into phone-codes and the use of the data-base rule-system
in the concatenation of the ABUs. Hearing tests have been carried out to test the quality
of the synthesized speech
ACOUSTIC BUILDING UNITS FOR FORMANT SYNTHESIS TEXT-TO-SPEECH CONVERTER SYSTEM FOR MODERN STANDARD ARABIC
In this paper an inventory of acoustic building units (ABUs) used for the synthesis of
Arabic speech is presented. The ABUs are generated for the free programmable PCF-8200
formant synthesizer chip which has been used in the development of the real time text-to-speech
multilingual system, the MULTIVOX. To utilize these ABUs for the synthesis of
Arabic speech a set of 36 Arabic sounds and all their possible combinations are defined.
The inventory of 255 ABUs is designed that each sound combination can be built up by
using some of those ABUs. A grapheme-to-phone-code converter is designed so to convert
the written input text into its equivalent phone-codes. Furthermore, it contains solutions
for the difficult phonetic problems in the Arabic input text
- …