Unknown word (UNK) or open vocabulary is a challenging problem
for neural machine translation (NMT). For alphabetic languages such as English,
German and French, transforming a word into subwords is an effective way to alleviate the UNK problem, such as the Byte Pair encoding (BPE) algorithm. However, for the stroke-based languages, such as Chinese, aforementioned method is
not effective enough for translation quality. In this paper, we propose to utilize
Pinyin, a romanization system for Chinese characters, to convert Chinese characters to subword units to alleviate the UNK problem. We first investigate that
how Pinyin and its four diacritics denoting tones affect translation performance
of NMT systems, and then propose different strategies to utilise Pinyin and tones
as input factors for Chinese–English NMT. Extensive experiments conducted on
Chinese–English translation demonstrate that the proposed methods can remarkably improve the translation quality, and can effectively alleviate the UNK problem for Chinese-sourced translation