5 research outputs found
The University of Edinburgh's Submission to the WMT22 Code-Mixing Shared Task (MixMT)
The University of Edinburgh participated in the WMT22 shared task on code-mixed translation. This consists of two subtasks: i) generating code-mixed Hindi/English (Hinglish) text generation from parallel Hindi and English sentences and ii) machine translation from Hinglish to English. As both subtasks are considered low-resource, we focused our efforts on careful data generation and curation, especially the use of backtranslation from monolingual resources. For subtask 1 we explored the effects of constrained decoding on English and transliterated subwords in order to produce Hinglish. For subtask 2, we investigated different pretraining techniques, namely comparing simple initialisation from existing machine translation models and aligned augmentation. For both subtasks, we found that our baseline systems worked best. Our systems for both subtasks were one of the overall top-performing submissions
The University of Edinburgh’s Submissions to the WMT19 News Translation Task
The University of Edinburgh participated in the WMT19 Shared Task on News
Translation in six language directions: English-to-Gujarati,
Gujarati-to-English, English-to-Chinese, Chinese-to-English, German-to-English,
and English-to-Czech. For all translation directions, we created or used
back-translations of monolingual data in the target language as additional
synthetic training data. For English-Gujarati, we also explored semi-supervised
MT with cross-lingual language model pre-training, and translation pivoting
through Hindi. For translation to and from Chinese, we investigated
character-based tokenisation vs. sub-word segmentation of Chinese text. For
German-to-English, we studied the impact of vast amounts of back-translated
training data on translation quality, gaining a few additional insights over
Edunov et al. (2018). For English-to-Czech, we compared different
pre-processing and tokenisation regimes.Comment: To appear in the Proceedings of WMT19: Shared Task Paper
ParaCrawl: Web-Scale Acquisition of Parallel Corpora
We report on methods to create the largest publicly available parallel corpora by crawling the web, using open source software. We empirically compare alternative methods and publish benchmark data sets for sentence alignment and sentence pair filtering. We also describe the parallel corpora released and evaluate their quality and their usefulness to create machine translation systems