4 research outputs found
Simple Automatic Post-editing for Arabic-Japanese Machine Translation
A common bottleneck for developing machine translation (MT) systems for some
language pairs is the lack of direct parallel translation data sets, in general
and in certain domains. Alternative solutions such as zero-shot models or
pivoting techniques are successful in getting a strong baseline, but are often
below the more supported language-pair systems. In this paper, we focus on
Arabic-Japanese machine translation, a less studied language pair; and we work
with a unique parallel corpus of Arabic news articles that were manually
translated to Japanese. We use this parallel corpus to adapt a state-of-the-art
domain/genre agnostic neural MT system via a simple automatic post-editing
technique. Our results and detailed analysis suggest that this approach is
quite viable for less supported language pairs in specific domains.Comment: Machine translation, Automatic Post editing, Arabic, Japanes
Leveraging Multilingual News Websites for Building a Kurdish Parallel Corpus
Machine translation has been a major motivation of development in natural
language processing. Despite the burgeoning achievements in creating more
efficient machine translation systems thanks to deep learning methods, parallel
corpora have remained indispensable for progress in the field. In an attempt to
create parallel corpora for the Kurdish language, in this paper, we describe
our approach in retrieving potentially-alignable news articles from
multi-language websites and manually align them across dialects and languages
based on lexical similarity and transliteration of scripts. We present a corpus
containing 12,327 translation pairs in the two major dialects of Kurdish,
Sorani and Kurmanji. We also provide 1,797 and 650 translation pairs in
English-Kurmanji and English-Sorani. The corpus is publicly available under the
CC BY-NC-SA 4.0 license.Comment: 11 pages, under review in the ACM Transactions on Asian and
Low-Resource Language Information Processing (TALLIP) Corpus available at
https://github.com/KurdishBLARK/InterdialectCorpu
Arabic natural language processing: An overview
Arabic is recognised as the 4th most used language of the Internet. Arabic
has three main varieties: (1) classical Arabic (CA), (2) Modern Standard Arabic
(MSA), (3) Arabic Dialect (AD). MSA and AD could be written either in Arabic or
in Roman script (Arabizi), which corresponds to Arabic written with Latin
letters, numerals and punctuation. Due to the complexity of this language and
the number of corresponding challenges for NLP, many surveys have been
conducted, in order to synthesise the work done on Arabic. However these
surveys principally focus on two varieties of Arabic (MSA and AD, written in
Arabic letters only), they are slightly old (no such survey since 2015) and
therefore do not cover recent resources and tools. To bridge the gap, we
propose a survey focusing on 90 recent research papers (74% of which were
published after 2015). Our study presents and classifies the work done on the
three varieties of Arabic, by concentrating on both Arabic and Arabizi, and
associates each work to its publicly available resources whenever available
Amharic-Arabic Neural Machine Translation
Many automatic translation works have been addressed between major European
language pairs, by taking advantage of large scale parallel corpora, but very
few research works are conducted on the Amharic-Arabic language pair due to its
parallel data scarcity. Two Long Short-Term Memory (LSTM) and Gated Recurrent
Units (GRU) based Neural Machine Translation (NMT) models are developed using
Attention-based Encoder-Decoder architecture which is adapted from the
open-source OpenNMT system. In order to perform the experiment, a small
parallel Quranic text corpus is constructed by modifying the existing
monolingual Arabic text and its equivalent translation of Amharic language text
corpora available on Tanzile. LSTM and GRU based NMT models and Google
Translation system are compared and found that LSTM based OpenNMT outperforms
GRU based OpenNMT and Google Translation system, with a BLEU score of 12%, 11%,
and 6% respectively.Comment: 15 page