2 research outputs found
PJAIT Systems for the IWSLT 2015 Evaluation Campaign Enhanced by Comparable Corpora
In this paper, we attempt to improve Statistical Machine Translation (SMT)
systems on a very diverse set of language pairs (in both directions): Czech -
English, Vietnamese - English, French - English and German - English. To
accomplish this, we performed translation model training, created adaptations
of training settings for each language pair, and obtained comparable corpora
for our SMT systems. Innovative tools and data adaptation techniques were
employed. The TED parallel text corpora for the IWSLT 2015 evaluation campaign
were used to train language models, and to develop, tune, and test the system.
In addition, we prepared Wikipedia-based comparable corpora for use with our
SMT system. This data was specified as permissible for the IWSLT 2015
evaluation. We explored the use of domain adaptation techniques, symmetrized
word alignment models, the unsupervised transliteration models and the KenLM
language modeling tool. To evaluate the effects of different preparations on
translation results, we conducted experiments and used the BLEU, NIST and TER
metrics. Our results indicate that our approach produced a positive impact on
SMT quality
Multitask Learning For Different Subword Segmentations In Neural Machine Translation
In Neural Machine Translation (NMT) the usage of subwords and characters as
source and target units offers a simple and flexible solution for translation
of rare and unseen words. However, selecting the optimal subword segmentation
involves a trade-off between expressiveness and flexibility, and is language
and dataset-dependent. We present Block Multitask Learning (BMTL), a novel NMT
architecture that predicts multiple targets of different granularities
simultaneously, removing the need to search for the optimal segmentation
strategy. Our multi-task model exhibits improvements of up to 1.7 BLEU points
on each decoder over single-task baseline models with the same number of
parameters on datasets from two language pairs of IWSLT15 and one from IWSLT19.
The multiple hypotheses generated at different granularities can be combined as
a post-processing step to give better translations, which improves over
hypothesis combination from baseline models while using substantially fewer
parameters.Comment: Accepted to 16th International Workshop on Spoken Language
Translation (IWSLT) 201