15 research outputs found
Pinyin as subword unit for Chinese-sourced neural machine translation
Unknown word (UNK) or open vocabulary is a challenging problem
for neural machine translation (NMT). For alphabetic languages such as English,
German and French, transforming a word into subwords is an effective way to alleviate the UNK problem, such as the Byte Pair encoding (BPE) algorithm. However, for the stroke-based languages, such as Chinese, aforementioned method is
not effective enough for translation quality. In this paper, we propose to utilize
Pinyin, a romanization system for Chinese characters, to convert Chinese characters to subword units to alleviate the UNK problem. We first investigate that
how Pinyin and its four diacritics denoting tones affect translation performance
of NMT systems, and then propose different strategies to utilise Pinyin and tones
as input factors for ChineseâEnglish NMT. Extensive experiments conducted on
ChineseâEnglish translation demonstrate that the proposed methods can remarkably improve the translation quality, and can effectively alleviate the UNK problem for Chinese-sourced translation
Character-level Chinese-English Translation through ASCII Encoding
Character-level Neural Machine Translation (NMT) models have recently
achieved impressive results on many language pairs. They mainly do well for
Indo-European language pairs, where the languages share the same writing
system. However, for translating between Chinese and English, the gap between
the two different writing systems poses a major challenge because of a lack of
systematic correspondence between the individual linguistic units. In this
paper, we enable character-level NMT for Chinese, by breaking down Chinese
characters into linguistic units similar to that of Indo-European languages. We
use the Wubi encoding scheme, which preserves the original shape and semantic
information of the characters, while also being reversible. We show promising
results from training Wubi-based models on the character- and subword-level
with recurrent as well as convolutional models.Comment: 7 pages, 3 figures, 3rd Conference on Machine Translation (WMT18),
201
On Romanization for Model Transfer Between Scripts in Neural Machine Translation
Transfer learning is a popular strategy to improve the quality of low-resource machine translation. For an optimal transfer of the embedding layer, the child and parent model should share a substantial part of the vocabulary. This is not the case when transferring to languages with a different script. We explore the benefit of romanization in this scenario. Our results show that romanization entails information loss and is thus not always superior to simpler vocabulary transfer methods, but can improve the transfer between related languages with different scripts. We compare two romanization tools and find that they exhibit different degrees of information loss, which affects translation quality. Finally, we extend romanization to the target side, showing that this can be a successful strategy when coupled with a simple deromanization model