Search CORE

2 research outputs found

Egyptian Arabic to English Statistical Machine Translation System for NIST OpenMT'2015

Author: Abdelali Ahmed
Durrani Nadir
Guzman Francisco
Habash Nizar
Kholy Ahmed El
Nakov Preslav
Sajjad Hassan
Salloum Wael
Vogel Stephan
Publication venue
Publication date: 18/06/2016
Field of study

The paper describes the Egyptian Arabic-to-English statistical machine translation (SMT) system that the QCRI-Columbia-NYUAD (QCN) group submitted to the NIST OpenMT'2015 competition. The competition focused on informal dialectal Arabic, as used in SMS, chat, and speech. Thus, our efforts focused on processing and standardizing Arabic, e.g., using tools such as 3arrib and MADAMIRA. We further trained a phrase-based SMT system using state-of-the-art features and components such as operation sequence model, class-based language model, sparse features, neural network joint model, genre-based hierarchically-interpolated language model, unsupervised transliteration mining, phrase-table merging, and hypothesis combination. Our system ranked second on all three genres

arXiv.org e-Print Archive

Synthetic Data for Neural Machine Translation of Spoken-Dialects

Author: Elaraby Mostafa
Hassan Hany
Tawfik Ahmed
Publication venue
Publication date: 28/11/2017
Field of study

In this paper, we introduce a novel approach to generate synthetic data for training Neural Machine Translation systems. The proposed approach transforms a given parallel corpus between a written language and a target language to a parallel corpus between a spoken dialect variant and the target language. Our approach is language independent and can be used to generate data for any variant of the source language such as slang or spoken dialect or even for a different language that is closely related to the source language. The proposed approach is based on local embedding projection of distributed representations which utilizes monolingual embeddings to transform parallel data across language variants. We report experimental results on Levantine to English translation using Neural Machine Translation. We show that the generated data can improve a very large scale system by more than 2.8 Bleu points using synthetic spoken data which shows that it can be used to provide a reliable translation system for a spoken dialect that does not have sufficient parallel data

arXiv.org e-Print Archive