Search CORE

4 research outputs found

Simple Automatic Post-editing for Arabic-Japanese Machine Translation

Author: Habash Nizar
Noll Ella
Oudah Mai
Publication venue
Publication date: 14/07/2019
Field of study

A common bottleneck for developing machine translation (MT) systems for some language pairs is the lack of direct parallel translation data sets, in general and in certain domains. Alternative solutions such as zero-shot models or pivoting techniques are successful in getting a strong baseline, but are often below the more supported language-pair systems. In this paper, we focus on Arabic-Japanese machine translation, a less studied language pair; and we work with a unique parallel corpus of Arabic news articles that were manually translated to Japanese. We use this parallel corpus to adapt a state-of-the-art domain/genre agnostic neural MT system via a simple automatic post-editing technique. Our results and detailed analysis suggest that this approach is quite viable for less supported language pairs in specific domains.Comment: Machine translation, Automatic Post editing, Arabic, Japanes

arXiv.org e-Print Archive

Leveraging Multilingual News Websites for Building a Kurdish Parallel Corpus

Author: Ahmadi Sina
Hassani Hossein
Jaff Daban Q.
Publication venue
Publication date: 04/10/2020
Field of study

Machine translation has been a major motivation of development in natural language processing. Despite the burgeoning achievements in creating more efficient machine translation systems thanks to deep learning methods, parallel corpora have remained indispensable for progress in the field. In an attempt to create parallel corpora for the Kurdish language, in this paper, we describe our approach in retrieving potentially-alignable news articles from multi-language websites and manually align them across dialects and languages based on lexical similarity and transliteration of scripts. We present a corpus containing 12,327 translation pairs in the two major dialects of Kurdish, Sorani and Kurmanji. We also provide 1,797 and 650 translation pairs in English-Kurmanji and English-Sorani. The corpus is publicly available under the CC BY-NC-SA 4.0 license.Comment: 11 pages, under review in the ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) Corpus available at https://github.com/KurdishBLARK/InterdialectCorpu

arXiv.org e-Print Archive

Arabic natural language processing: An overview

Author: Azouaou Faical
Guellil Imane
Gueni Billel
Nouvel Damien
Saâdane Houda
Publication venue: 'Elsevier BV'
Publication date: 07/03/2019
Field of study

Arabic is recognised as the 4th most used language of the Internet. Arabic has three main varieties: (1) classical Arabic (CA), (2) Modern Standard Arabic (MSA), (3) Arabic Dialect (AD). MSA and AD could be written either in Arabic or in Roman script (Arabizi), which corresponds to Arabic written with Latin letters, numerals and punctuation. Due to the complexity of this language and the number of corresponding challenges for NLP, many surveys have been conducted, in order to synthesise the work done on Arabic. However these surveys principally focus on two varieties of Arabic (MSA and AD, written in Arabic letters only), they are slightly old (no such survey since 2015) and therefore do not cover recent resources and tools. To bridge the gap, we propose a survey focusing on 90 recent research papers (74% of which were published after 2015). Our study presents and classifies the work done on the three varieties of Arabic, by concentrating on both Arabic and Arabizi, and associates each work to its publicly available resources whenever available

arXiv.org e-Print Archive

Amharic-Arabic Neural Machine Translation

Author: Gashaw Ibrahim
Shashirekha H L
Publication venue
Publication date: 26/12/2019
Field of study

Many automatic translation works have been addressed between major European language pairs, by taking advantage of large scale parallel corpora, but very few research works are conducted on the Amharic-Arabic language pair due to its parallel data scarcity. Two Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU) based Neural Machine Translation (NMT) models are developed using Attention-based Encoder-Decoder architecture which is adapted from the open-source OpenNMT system. In order to perform the experiment, a small parallel Quranic text corpus is constructed by modifying the existing monolingual Arabic text and its equivalent translation of Amharic language text corpora available on Tanzile. LSTM and GRU based NMT models and Google Translation system are compared and found that LSTM based OpenNMT outperforms GRU based OpenNMT and Google Translation system, with a BLEU score of 12%, 11%, and 6% respectively.Comment: 15 page

arXiv.org e-Print Archive