In this paper, we apply a method of unsupervised morphology learning to a state-of-the-art phrase-based statistical machine translation (SMT) system. In SMT, words are traditionally used as the smallest units of translation. Such a system generalizes poorly to word forms that do not occur in the training data. In particular, this is problematic for languages that are highly compounding, highly inflecting, or both. An alternative way is to use sub-word units, such as morphemes. We use the Morfessor algorithm to find statistical morphemelike units (called morphs) that can be used to reduce the size of the lexicon and improve the ability to generalize. Translation and language models are trained directly on morphs instead of words. The approach is tested on three Nordic languages (Danish, Finnish, and Swedish) that are included in the Europarl corpus consisting of the Proceedings of the European Parliament. However, in our experiments we did not obtain higher BLEU scores for the morph model than for the standard word-based approach. Nonetheless, the proposed morph-based solution has clear benefits, as morphologically well motivated structures (phrases) are learned, and the proportion of words left untranslated is clearly reduced
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.