202 research outputs found
Mitigating the Exposure Bias in Sentence-Level Grapheme-to-Phoneme (G2P) Transduction
Text-to-Text Transfer Transformer (T5) has recently been considered for the
Grapheme-to-Phoneme (G2P) transduction. As a follow-up, a tokenizer-free
byte-level model based on T5 referred to as ByT5, recently gave promising
results on word-level G2P conversion by representing each input character with
its corresponding UTF-8 encoding. Although it is generally understood that
sentence-level or paragraph-level G2P can improve usability in real-world
applications as it is better suited to perform on heteronyms and linking sounds
between words, we find that using ByT5 for these scenarios is nontrivial. Since
ByT5 operates on the character level, it requires longer decoding steps, which
deteriorates the performance due to the exposure bias commonly observed in
auto-regressive generation models. This paper shows that the performance of
sentence-level and paragraph-level G2P can be improved by mitigating such
exposure bias using our proposed loss-based sampling method.Comment: INTERSPEECH 202
Multi-Module G2P Converter for Persian Focusing on Relations between Words
In this paper, we investigate the application of end-to-end and multi-module
frameworks for G2P conversion for the Persian language. The results demonstrate
that our proposed multi-module G2P system outperforms our end-to-end systems in
terms of accuracy and speed. The system consists of a pronunciation dictionary
as our look-up table, along with separate models to handle homographs, OOVs and
ezafe in Persian created using GRU and Transformer architectures. The system is
sequence-level rather than word-level, which allows it to effectively capture
the unwritten relations between words (cross-word information) necessary for
homograph disambiguation and ezafe recognition without the need for any
pre-processing. After evaluation, our system achieved a 94.48% word-level
accuracy, outperforming the previous G2P systems for Persian.Comment: 10 pages, 4 figure
An overview of text-to-speech systems and media applications
Producing synthetic voice, similar to human-like sound, is an emerging
novelty of modern interactive media systems. Text-To-Speech (TTS) systems try
to generate synthetic and authentic voices via text input. Besides, well known
and familiar dubbing, announcing and narrating voices, as valuable possessions
of any media organization, can be kept forever by utilizing TTS and Voice
Conversion (VC) algorithms . The emergence of deep learning approaches has made
such TTS systems more accurate and accessible. To understand TTS systems
better, this paper investigates the key components of such systems including
text analysis, acoustic modelling and vocoding. The paper then provides details
of important state-of-the-art TTS systems based on deep learning. Finally, a
comparison is made between recently released systems in term of backbone
architecture, type of input and conversion, vocoder used and subjective
assessment (MOS). Accordingly, Tacotron 2, Transformer TTS, WaveNet and
FastSpeech 1 are among the most successful TTS systems ever released. In the
discussion section, some suggestions are made to develop a TTS system with
regard to the intended application.Comment: Accepted in ABU Technical Review journal 2023/
On Biasing Transformer Attention Towards Monotonicity
Many sequence-to-sequence tasks in natural language processing are roughly
monotonic in the alignment between source and target sequence, and previous
work has facilitated or enforced learning of monotonic attention behavior via
specialized attention functions or pretraining. In this work, we introduce a
monotonicity loss function that is compatible with standard attention
mechanisms and test it on several sequence-to-sequence tasks:
grapheme-to-phoneme conversion, morphological inflection, transliteration, and
dialect normalization. Experiments show that we can achieve largely monotonic
behavior. Performance is mixed, with larger gains on top of RNN baselines.
General monotonicity does not benefit transformer multihead attention, however,
we see isolated improvements when only a subset of heads is biased towards
monotonic behavior.Comment: To be published in: Proceedings of the 2021 Conference of the North
American Chapter of the Association for Computational Linguistics: Human
Language Technologies (NAACL-HLT 2021
- …