Search CORE

1,769 research outputs found

Nakdan: Professional Hebrew Diacritizer

Author: Goldberg Yoav
Koppel Moshe
Shmidman Avi
Shmidman Shaltiel
Publication venue
Publication date: 07/05/2020
Field of study

We present a system for automatic diacritization of Hebrew text. The system combines modern neural models with carefully curated declarative linguistic knowledge and comprehensive manually constructed tables and dictionaries. Besides providing state of the art diacritization accuracy, the system also supports an interface for manual editing and correction of the automatic output, and has several features which make it particularly useful for preparation of scientific editions of Hebrew texts. The system supports Modern Hebrew, Rabbinic Hebrew and Poetic Hebrew. The system is freely accessible for all use at http://nakdanpro.dicta.org.il.Comment: Accepted to ACL 2020, System Demonstration

arXiv.org e-Print Archive

Arabic natural language processing: An overview

Author: Azouaou Faical
Guellil Imane
Gueni Billel
Nouvel Damien
Saâdane Houda
Publication venue: 'Elsevier BV'
Publication date: 07/03/2019
Field of study

Arabic is recognised as the 4th most used language of the Internet. Arabic has three main varieties: (1) classical Arabic (CA), (2) Modern Standard Arabic (MSA), (3) Arabic Dialect (AD). MSA and AD could be written either in Arabic or in Roman script (Arabizi), which corresponds to Arabic written with Latin letters, numerals and punctuation. Due to the complexity of this language and the number of corresponding challenges for NLP, many surveys have been conducted, in order to synthesise the work done on Arabic. However these surveys principally focus on two varieties of Arabic (MSA and AD, written in Arabic letters only), they are slightly old (no such survey since 2015) and therefore do not cover recent resources and tools. To bridge the gap, we propose a survey focusing on 90 recent research papers (74% of which were published after 2015). Our study presents and classifies the work done on the three varieties of Arabic, by concentrating on both Arabic and Arabizi, and associates each work to its publicly available resources whenever available

arXiv.org e-Print Archive

Cross-Lingual Transfer of Semantic Roles: From Raw Text to Semantic Roles

Author: Aminian Maryam
Diab Mona
Rasooli Mohammad Sadegh
Publication venue
Publication date: 05/04/2019
Field of study

We describe a transfer method based on annotation projection to develop a dependency-based semantic role labeling system for languages for which no supervised linguistic information other than parallel data is available. Unlike previous work that presumes the availability of supervised features such as lemmas, part-of-speech tags, and dependency parse trees, we only make use of word and character features. Our deep model considers using character-based representations as well as unsupervised stem embeddings to alleviate the need for supervised features. Our experiments outperform a state-of-the-art method that uses supervised lexico-syntactic features on 6 out of 7 languages in the Universal Proposition Bank.Comment: Accepted at the 13th International Conference on Computational Semantics (IWCS 2019

arXiv.org e-Print Archive

Multi Task Deep Morphological Analyzer: Context Aware Joint Morphological Tagging and Lemma Prediction

Author: Jha Saurav
Singh Anil Kumar
Sudhakar Akhilesh
Publication venue
Publication date: 16/09/2019
Field of study

The ambiguities introduced by the recombination of morphemes constructing several possible inflections for a word makes the prediction of syntactic traits in Morphologically Rich Languages (MRLs) a notoriously complicated task. We propose the Multi Task Deep Morphological analyzer (MT-DMA), a character-level neural morphological analyzer based on multitask learning of word-level tag markers for Hindi and Urdu. MT-DMA predicts a set of six morphological tags for words of Indo-Aryan languages: Parts-of-speech (POS), Gender (G), Number (N), Person (P), Case (C), Tense-Aspect-Modality (TAM) marker as well as the Lemma (L) by jointly learning all these in one trainable framework. We show the effectiveness of training of such deep neural networks by the simultaneous optimization of multiple loss functions and sharing of initial parameters for context-aware morphological analysis. Exploiting character-level features in phonological space optimized for each tag using multi-objective genetic algorithm, our model establishes a new state-of-the-art accuracy score upon all seven of the tasks for both the languages. MT-DMA is publicly accessible: code, models and data are available at https://github.com/Saurav0074/morph_analyzer.Comment: 28 pages, 8 figures, 11 table

arXiv.org e-Print Archive

Morphological Embeddings for Named Entity Recognition in Morphologically Rich Languages

Author: Gungor Onur
Gungor Tunga
Uskudarli Suzan
Yildiz Eray
Publication venue
Publication date: 01/06/2017
Field of study

In this work, we present new state-of-the-art results of 93.59,% and 79.59,% for Turkish and Czech named entity recognition based on the model of (Lample et al., 2016). We contribute by proposing several schemes for representing the morphological analysis of a word in the context of named entity recognition. We show that a concatenation of this representation with the word and character embeddings improves the performance. The effect of these representation schemes on the tagging performance is also investigated.Comment: Working draf

arXiv.org e-Print Archive

First Result on Arabic Neural Machine Translation

Author: Almahairi Amjad
Cho Kyunghyun
Courville Aaron
Habash Nizar
Publication venue
Publication date: 08/06/2016
Field of study

Neural machine translation has become a major alternative to widely used phrase-based statistical machine translation. We notice however that much of research on neural machine translation has focused on European languages despite its language agnostic nature. In this paper, we apply neural machine translation to the task of Arabic translation (ArEn) and compare it against a standard phrase-based translation system. We run extensive comparison using various configurations in preprocessing Arabic script and show that the phrase-based and neural translation systems perform comparably to each other and that proper preprocessing of Arabic script has a similar effect on both of the systems. We however observe that the neural machine translation significantly outperform the phrase-based system on an out-of-domain test set, making it attractive for real-world deployment.Comment: EMNLP submissio

arXiv.org e-Print Archive

Machine-Translation History and Evolution: Survey for Arabic-English Translations

Author: Alsohybe Nabeel T.
Ba-Alwi Fadl Mutaher
Dahan Neama Abdulaziz
Publication venue: 'Sciencedomain International'
Publication date: 14/09/2017
Field of study

As a result of the rapid changes in information and communication technology (ICT), the world has become a small village where people from all over the world connect with each other in dialogue and communication via the Internet. Also, communications have become a daily routine activity due to the new globalization where companies and even universities become global residing cross countries borders. As a result, translation becomes a needed activity in this connected world. ICT made it possible to have a student in one country take a course or even a degree from a different country anytime anywhere easily. The resulted communication still needs a language as a means that helps the receiver understands the contents of the sent message. People need an automated translation application because human translators are hard to find all the times, and the human translations are very expensive comparing to the translations automated process. Several types of research describe the electronic process of the Machine-Translation. In this paper, the authors are going to study some of these previous researches, and they will explore some of the needed tools for the Machine-Translation. This research is going to contribute to the Machine-Translation area by helping future researchers to have a summary for the Machine-Translation groups of research and to let lights on the importance of the translation mechanism.Comment: 19 pages, 5 figures, 3 tables, survey article pape

arXiv.org e-Print Archive

Morphological analysis using a sequence decoder

Author: Akyürek Ekin
Dayanık Erenay
Yuret Deniz
Publication venue: 'MIT Press - Journals'
Publication date: 24/09/2019
Field of study

We introduce Morse, a recurrent encoder-decoder model that produces morphological analyses of each word in a sentence. The encoder turns the relevant information about the word and its context into a fixed size vector representation and the decoder generates the sequence of characters for the lemma followed by a sequence of individual morphological features. We show that generating morphological features individually rather than as a combined tag allows the model to handle rare or unseen tags and outperform whole-tag models. In addition, generating morphological features as a sequence rather than e.g.\ an unordered set allows our model to produce an arbitrary number of features that represent multiple inflectional groups in morphologically complex languages. We obtain state-of-the art results in nine languages of different morphological complexity under low-resource, high-resource and transfer learning settings. We also introduce TrMor2018, a new high accuracy Turkish morphology dataset. Our Morse implementation and the TrMor2018 dataset are available online to support future research\footnote{See \url{https://github.com/ai-ku/Morse.jl} for a Morse implementation in Julia/Knet \cite{knet2016mlsys} and \url{https://github.com/ai-ku/TrMor2018} for the new Turkish dataset.}.Comment: Final TACL versio

arXiv.org e-Print Archive

Analysis Methods in Neural Language Processing: A Survey

Author: Belinkov Yonatan
Glass James
Publication venue
Publication date: 14/01/2019
Field of study

The field of natural language processing has seen impressive progress in recent years, with neural network models replacing many of the traditional systems. A plethora of new models have been proposed, many of which are thought to be opaque compared to their feature-rich counterparts. This has led researchers to analyze, interpret, and evaluate neural networks in novel and more fine-grained ways. In this survey paper, we review analysis methods in neural language processing, categorize them according to prominent research trends, highlight existing limitations, and point to potential directions for future work.Comment: Version including the supplementary materials (3 tables), also available at https://boknilev.github.io/nlp-analysis-method

arXiv.org e-Print Archive

Improving Named Entity Recognition by Jointly Learning to Disambiguate Morphological Tags

Author: Güngör Onur
Güngör Tunga
Üsküdarlı Suzan
Publication venue
Publication date: 17/07/2018
Field of study

Previous studies have shown that linguistic features of a word such as possession, genitive or other grammatical cases can be employed in word representations of a named entity recognition (NER) tagger to improve the performance for morphologically rich languages. However, these taggers require external morphological disambiguation (MD) tools to function which are hard to obtain or non-existent for many languages. In this work, we propose a model which alleviates the need for such disambiguators by jointly learning NER and MD taggers in languages for which one can provide a list of candidate morphological analyses. We show that this can be done independent of the morphological annotation schemes, which differ among languages. Our experiments employing three different model architectures that join these two tasks show that joint learning improves NER performance. Furthermore, the morphological disambiguator's performance is shown to be competitive.Comment: COLING 2018 (accepted

arXiv.org e-Print Archive