2 research outputs found
Egyptian Arabic to English Statistical Machine Translation System for NIST OpenMT'2015
The paper describes the Egyptian Arabic-to-English statistical machine
translation (SMT) system that the QCRI-Columbia-NYUAD (QCN) group submitted to
the NIST OpenMT'2015 competition. The competition focused on informal dialectal
Arabic, as used in SMS, chat, and speech. Thus, our efforts focused on
processing and standardizing Arabic, e.g., using tools such as 3arrib and
MADAMIRA. We further trained a phrase-based SMT system using state-of-the-art
features and components such as operation sequence model, class-based language
model, sparse features, neural network joint model, genre-based
hierarchically-interpolated language model, unsupervised transliteration
mining, phrase-table merging, and hypothesis combination. Our system ranked
second on all three genres
Synthetic Data for Neural Machine Translation of Spoken-Dialects
In this paper, we introduce a novel approach to generate synthetic data for
training Neural Machine Translation systems. The proposed approach transforms a
given parallel corpus between a written language and a target language to a
parallel corpus between a spoken dialect variant and the target language. Our
approach is language independent and can be used to generate data for any
variant of the source language such as slang or spoken dialect or even for a
different language that is closely related to the source language.
The proposed approach is based on local embedding projection of distributed
representations which utilizes monolingual embeddings to transform parallel
data across language variants. We report experimental results on Levantine to
English translation using Neural Machine Translation. We show that the
generated data can improve a very large scale system by more than 2.8 Bleu
points using synthetic spoken data which shows that it can be used to provide a
reliable translation system for a spoken dialect that does not have sufficient
parallel data