In this paper, we make freely accessible ANETAC, our English-Arabic named entity transliteration and classification dataset that we built from freely available parallel translation corpora. The dataset contains 79, 924 instances, each instance is a triplet (e, a, c), where e is the English named entity, a is its Arabic transliteration and c is its class that can be either a Person, a Location, or an Organization. The ANETAC dataset is mainly aimed for the researchers that are working on Arabic named entity transliteration, but it can also be used for named entity classification purposes. This dataset was developed and used as part of a previous research study done by Hadj Ameur et al. [1]

Guessoum, A

Hadj Ameur, MS

Meziane, F

arXiv

University of Salford Institutional Repository

ANETAC: ARABIC NAMED ENTITY TRANSLITERATION ANDCLASSIFICATION DATASETA PREPRINTMohamed Seghir Hadj Ameur∗Department of Computer ScienceUSTHB UniversityBab-Ezzouar, Algiers, Algeriamhadjameur@usthb.dzFarid MezianeInformatics Research CentreUniversity of SalfordM5 4WT, United Kingdomf.meziane@salford.ac.ukAhmed GuessoumDepartment of Computer ScienceUSTHB UniversityBab-Ezzouar, Algiers, Algeriaaguessoum@usthb.dzJuly 9, 2019AbstractIn this paper, we make freely accessible ANETAC1 our English-Arabic named entity transliteration andclassification dataset that we built from freely available parallel translation corpora. The dataset contains79, 924 instances, each instance is a triplet (e, a, c), where e is the English named entity, a is its Arabictransliteration and c is its class that can be either a Person, a Location, or an Organization. The ANETACdataset is mainly aimed for the researchers that are working on Arabic named entity transliteration, but it canalso be used for named entity classification purposes. This dataset was developed and used as part of a previousresearch study done by Hadj Ameur et al. [1].Keywords Natural Language Processing · Arabic Language · Arabic Transliteration · Named Entity Transliteration ·Arabic Named Entity · Arabic Transliteration Dataset1 IntroductionThe task of transliteration is the process of converting words (e.g. named entities) that are written in one languagealphabet to another language that has a different alphabet while still preserving the phonetics of the transliterated words.One of the main difficulties when attempting to transliterate named entities from a given source language to another isthe lack of some phonetic character correspondences. For example, in the task of named entity transliteration betweenArabic and English, several Arabic letters such as “ H”and “ 	 ” do not have direct single-letter correspondences inthe English language alphabet. Table 1 presents some English named entities and their transliteration in the Arabiclanguage.Table 1: English named entities and their equivalent Arabic transliterationsEnglish ArabicBrandes Y	K @QK. (Brandees)Mayhawk ¼ñîE AÓ (Mayhouk)Cressner Q	JQ» (Crissneer)Husseini ú	æJk (Husseini)Accurate transliteration of named entities is useful for several applications such as machine translation [2, 3], andcross-lingual information retrieval [4, 5]. Though a great deal of attention has been devoted to improving this task for∗Corresponding author. Feel free to contact me via my personal email mohamedhadjameur@gmail.com1The ANETAC dataset is freely available on Github https://github.com/MohamedHadjAmeur/ANETAC.arXiv:1907.03110v1  [cs.CL]  6 Jul 2019ANETAC: Arabic Named Entity Transliteration and Classification Dataset A PREPRINTmany languages such as English, only limited studies have been made with regard to Arabic mainly due to the lack oftransliteration datasets. In this paper, we make accessible ANETAC, an English-Arabic named entity transliteration andclassification dataset that we built from freely available parallel translation corpora. It contains 79,924 English-Arabicnamed entities along with their respective classes that can be either a Person, a Location, or an Organization. Table 2shows statistics about the ANETAC named entities classes.Table 2: Statistics about the number of named entities belonging to each class [1]Named entity CountPerson 61,662Location 12,679Organization 5,583All 79,924To make it easier for other researchers to train and compare their own models, the ANETAC dataset is divided intotraining, development, and test sets as shown in Table 3.Table 3: Instance counts in the train, development and test datasets of our transliteration corpus [1]Sets Train Dev TestNamed entities count 75,898 1004 3013As pointed out by many recent studies [1, 6], there is a lack of Arabic machine transliteration datasets. To the bestof our knowledge, there is only one freely available English-Arabic transliteration dataset that contains no more than12,877 pairs 2, thus, we believe that our dataset will be a valuable addition. The importance of the ANETAC dataset canbe summarized as follows:• This dataset is useful for many applications such as (1) training state-of-the-art English-Arabic machinetransliteration models, (2) training Arabic named entity classification models, (3) handling Out-Of-Vocabulary(OOV) words in machine translation, (4) dealing with proper names in Cross-lingual Information Retrieval.• This dataset is mainly aimed for those researchers working on Arabic named entity transliteration, but it canalso be used for named entity classification purposes.• This dataset also contains a test set that can be used as a benchmark to compare the results of English-Arabictransliteration systems. First transliteration results have been already reported on this test set by Hadj Ameuret al. [1] and will be shown in Section 3.In the remainder of this paper, section 2 presents the corpus construction methodology that we adopted in thedevelopment of this dataset. Section 3 presents the baseline transliteration results that have been obtained using theANETAC dataset. Finally, section 4 provides a conclusion to this paper.2 Building a Transliteration CorpusAs stated in the original work of Hadj Ameur et al. [1]3, the extraction system (see Fig. 1) uses freely available parallelcorpora4 in order to automatically extract bilingual named entities. The English-Arabic corpora that we have used areprovided in Table 4.Table 4: Statistics about the used English-Arabic parallel corpora [1]Corpus Sentences (in millions)United Nation 10.6MOpen Subtitles 24.4MNews Commentary 0.2MIWSLT2016 0.2MAll 35.4M2https://github.com/google/transliteration3We note that this description of the extraction system is mostly based on the original paper of Hadj Ameur et al. [1].4The English-Arabic parallel corpora that we used are available on the opus website: http://opus.nlpl.eu.2ANETAC: Arabic Named Entity Transliteration and Classification Dataset A PREPRINTAs shown in Fig. 1, the system starts by a preprocessing phase in which the English and Arabic sentences are tokenizedand normalized. Then, the English named entities are identified in each sentence belonging to the English-side ofthe parallel corpus. A set of Arabic transliteration candidates will then be associated with each English named entity.Finally, the best Arabic transliteration candidate will be selected for each English named entity. The detail of these stepare provided in the remainder of this section.Figure 1: Architecture of our parallel English-Arabic Named entity extraction system [1]2.1 Parallel Named Entity ExtractionThe ultimate goal is to extract the correct Arabic transliteration of each English named entity. Given a corpus ofEnglish-Arabic parallel sentences S = {(e1, a1), ..., (em, am)}, we use the Stanford English Named Entity Recognizer[7] to find all the English named entities that are present in the parallel corpus Ene = {n1, n2, ..., nk}, where k is thetotal number of named entities. Since each singleton word belonging to a multi-word English named entity can alwaysbe transliterated solely without needing its context, we decomposed all the English named entities containing multiplewords to several singleton entities. For each English named entity ni belonging to an English sentence ej , we end upwith a list of pairs (ni, aj) denoting that the ith English named entity (singleton word) is associated with the jth Arabicsentence.2.2 Candidates Extraction and ScoringThe previous step leaves us with a set of pairs (ni, aj), where ni is the English named entity (word) and aj is the Arabicsentence containing its transliteration. To find the correct transliterated word of ni in the Arabic sentence aj , we firstremoved all the frequent Arabic words from it using a vocabulary containing the top n most frequent Arabic words,with n = 40000, that we built automatically from our parallel corpus. This ensures that the remaining words in theArabic sentence aj are mostly rare words. All the remaining words in aj are considered as transliteration candidatesC(aj) = {cj1, cj2, ..., cjt}, where cji denotes the ith candidate word found in the jth Arabic sentence, and t is thetotal number of Arabic candidates in C(aj). We used the transliteration tool available in the polyglot multilingual NLPlibrary5 to obtain an approximate Arabic transliteration ti of each English named entity ni. For each English namedentity ni having the approximate transliteration ti and the list of Arabic candidates C(aj), the score of each Arabiccandidate is estimated using the following three features:1. The total number of shared characters: this feature takes into account the count of shared characters betweeneach Arabic candidate in C(aj) and the approximate transliteration ti.2. The longest shared sequence: this feature takes into account the length of the longest common sequence ofcharacters between each Arabic candidate in C(aj) and the approximate transliteration ti.3. Length difference penalty: this feature is used to penalize the C(aj) candidates according to their level ofdissimilarity with the approximate transliteration ti.The final score of each candidate is then estimated by averaging the score of all the three features. The candidatehaving the highest score is then selected if its corresponding final score surpasses a certain confidence threshold. Someexamples of the extracted English-Arabic named entities are provided in Table 5. The reader should recall that theArabic language has no letters for the English sound “v”, “p” and “g”.5https://github.com/aboSamoor/polyglot3ANETAC: Arabic Named Entity Transliteration and Classification Dataset A PREPRINTTable 5: Some examples of the extracted English-Arabic named entities [1]Entity class English ArabicPERSON Villalon 	àñËCJ	¯ (filaloun)LOCATION Nampa AJ.ÓA	K (namba)ORGANIZATION Soogrim Õç'Q	«ñ (soughrim)3 Baseline ResultsThis section provides the English-to-Arabic and Arabic-to-English baselines’ transliteration results that we haveobtained when using the ANETAC dataset for both the training and testing of our models [1]. The baseline results (Table6) are reported in terms of both Word Error Rate (WER) and Character Error Rate (CER) on the ANETAC test set6.Table 6: Baseline transliteration results in terms of WER and CER reported on the ANETAC test setDirections WER CEREnglish-to-Arabic 5.40 0.95Arabic-to-English 65,16 16.35As shown in Table 6, the results of the Arabic-to-English transliteration are still poor, thus much work is still neededto improve them. We note the baseline models that we have used are based on the attention-based encoder-decoderarchitecture [8] and trained at the character level.4 ConclusionIn this work, we have made accessible the ANETAC dataset, that we developed as part of our previous work [1]. Wehave shown how this dataset is built from parallel translation corpora by relying on several features and tools. We alsopresented the baseline results that we have achieved on the tasks of English-to-Arabic and Arabic-to-English machinetransliteration. We encourage all researchers that are interested in this task to try and achieve better results. Finally, wehope that this dataset will have a positive impact on the current state of Arabic-English named entity transliteration.References[1] Mohamed Seghir Hadj Ameur, Farid Meziane, and Ahmed Guessoum. Arabic machine transliteration using anattention-based encoder-decoder model. Procedia Computer Science, 117:287–297, 2017.[2] Ulf Hermjakob, Kevin Knight, and Hal Daumé III. Name translation in statistical machine translation-learningwhen to transliterate. In ACL, pages 389–397, 2008.[3] Nizar Habash. Four techniques for online handling of out-of-vocabulary words in arabic-english statistical machinetranslation. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on HumanLanguage Technologies: Short Papers, pages 57–60. Association for Computational Linguistics, 2008.[4] Paola Virga and Sanjeev Khudanpur. Transliteration of proper names in cross-lingual information retrieval. InProceedings of the ACL 2003 workshop on Multilingual and mixed-language named entity recognition-Volume 15,pages 57–64. Association for Computational Linguistics, 2003.[5] Atsushi Fujii and Tetsuya Ishikawa. Japanese/english cross-language information retrieval: Exploration of querytranslation and transliteration. Computers and the Humanities, 35(4):389–420, 2001.[6] Mihaela Rosca and Thomas Breuel. Sequence-to-sequence neural network models for transliteration. arXiv preprintarXiv:1610.09565, 2016.[7] Jenny Rose Finkel, Trond Grenager, and Christopher Manning. Incorporating non-local information into informationextraction systems by gibbs sampling. In Proceedings of the 43rd annual meeting on association for computationallinguistics, pages 363–370. Association for Computational Linguistics, 2005.[8] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to alignand translate. arXiv preprint arXiv:1409.0473, 2014.6https://github.com/MohamedHadjAmeur/ANETAC4

ANETAC: Arabic named entity transliteration and classification dataset

https://salford-repository.worktribe.com/file/1368215/1/1907.03110.pdf

ANETAC: Arabic named entity transliteration and classification dataset

Abstract

Similar works

Full text

Available Versions

University of Salford Institutional Repository