259 research outputs found
Mitigating the Exposure Bias in Sentence-Level Grapheme-to-Phoneme (G2P) Transduction
Text-to-Text Transfer Transformer (T5) has recently been considered for the
Grapheme-to-Phoneme (G2P) transduction. As a follow-up, a tokenizer-free
byte-level model based on T5 referred to as ByT5, recently gave promising
results on word-level G2P conversion by representing each input character with
its corresponding UTF-8 encoding. Although it is generally understood that
sentence-level or paragraph-level G2P can improve usability in real-world
applications as it is better suited to perform on heteronyms and linking sounds
between words, we find that using ByT5 for these scenarios is nontrivial. Since
ByT5 operates on the character level, it requires longer decoding steps, which
deteriorates the performance due to the exposure bias commonly observed in
auto-regressive generation models. This paper shows that the performance of
sentence-level and paragraph-level G2P can be improved by mitigating such
exposure bias using our proposed loss-based sampling method.Comment: INTERSPEECH 202
Applying the Transformer to Character-level Transduction
The transformer has been shown to outperform recurrent neural network-based
sequence-to-sequence models in various word-level NLP tasks. Yet for
character-level transduction tasks, e.g. morphological inflection generation
and historical text normalization, there are few works that outperform
recurrent models using the transformer. In an empirical study, we uncover that,
in contrast to recurrent sequence-to-sequence models, the batch size plays a
crucial role in the performance of the transformer on character-level tasks,
and we show that with a large enough batch size, the transformer does indeed
outperform recurrent models. We also introduce a simple technique to handle
feature-guided character-level transduction that further improves performance.
With these insights, we achieve state-of-the-art performance on morphological
inflection and historical text normalization. We also show that the transformer
outperforms a strong baseline on two other character-level transduction tasks:
grapheme-to-phoneme conversion and transliteration.Comment: EACL 202
Learning cross-lingual phonological and orthagraphic adaptations: a case study in improving neural machine translation between low-resource languages
Out-of-vocabulary (OOV) words can pose serious challenges for machine
translation (MT) tasks, and in particular, for low-resource language (LRL)
pairs, i.e., language pairs for which few or no parallel corpora exist. Our
work adapts variants of seq2seq models to perform transduction of such words
from Hindi to Bhojpuri (an LRL instance), learning from a set of cognate pairs
built from a bilingual dictionary of Hindi--Bhojpuri words. We demonstrate that
our models can be effectively used for language pairs that have limited
parallel corpora; our models work at the character level to grasp phonetic and
orthographic similarities across multiple types of word adaptations, whether
synchronic or diachronic, loan words or cognates. We describe the training
aspects of several character level NMT systems that we adapted to this task and
characterize their typical errors. Our method improves BLEU score by 6.3 on the
Hindi-to-Bhojpuri translation task. Further, we show that such transductions
can generalize well to other languages by applying it successfully to Hindi --
Bangla cognate pairs. Our work can be seen as an important step in the process
of: (i) resolving the OOV words problem arising in MT tasks, (ii) creating
effective parallel corpora for resource-constrained languages, and (iii)
leveraging the enhanced semantic knowledge captured by word-level embeddings to
perform character-level tasks.Comment: 47 pages, 4 figures, 21 tables (including Appendices
- …