1,769 research outputs found
Nakdan: Professional Hebrew Diacritizer
We present a system for automatic diacritization of Hebrew text. The system
combines modern neural models with carefully curated declarative linguistic
knowledge and comprehensive manually constructed tables and dictionaries.
Besides providing state of the art diacritization accuracy, the system also
supports an interface for manual editing and correction of the automatic
output, and has several features which make it particularly useful for
preparation of scientific editions of Hebrew texts. The system supports Modern
Hebrew, Rabbinic Hebrew and Poetic Hebrew. The system is freely accessible for
all use at http://nakdanpro.dicta.org.il.Comment: Accepted to ACL 2020, System Demonstration
Arabic natural language processing: An overview
Arabic is recognised as the 4th most used language of the Internet. Arabic
has three main varieties: (1) classical Arabic (CA), (2) Modern Standard Arabic
(MSA), (3) Arabic Dialect (AD). MSA and AD could be written either in Arabic or
in Roman script (Arabizi), which corresponds to Arabic written with Latin
letters, numerals and punctuation. Due to the complexity of this language and
the number of corresponding challenges for NLP, many surveys have been
conducted, in order to synthesise the work done on Arabic. However these
surveys principally focus on two varieties of Arabic (MSA and AD, written in
Arabic letters only), they are slightly old (no such survey since 2015) and
therefore do not cover recent resources and tools. To bridge the gap, we
propose a survey focusing on 90 recent research papers (74% of which were
published after 2015). Our study presents and classifies the work done on the
three varieties of Arabic, by concentrating on both Arabic and Arabizi, and
associates each work to its publicly available resources whenever available
Cross-Lingual Transfer of Semantic Roles: From Raw Text to Semantic Roles
We describe a transfer method based on annotation projection to develop a
dependency-based semantic role labeling system for languages for which no
supervised linguistic information other than parallel data is available. Unlike
previous work that presumes the availability of supervised features such as
lemmas, part-of-speech tags, and dependency parse trees, we only make use of
word and character features. Our deep model considers using character-based
representations as well as unsupervised stem embeddings to alleviate the need
for supervised features. Our experiments outperform a state-of-the-art method
that uses supervised lexico-syntactic features on 6 out of 7 languages in the
Universal Proposition Bank.Comment: Accepted at the 13th International Conference on Computational
Semantics (IWCS 2019
Multi Task Deep Morphological Analyzer: Context Aware Joint Morphological Tagging and Lemma Prediction
The ambiguities introduced by the recombination of morphemes constructing
several possible inflections for a word makes the prediction of syntactic
traits in Morphologically Rich Languages (MRLs) a notoriously complicated task.
We propose the Multi Task Deep Morphological analyzer (MT-DMA), a
character-level neural morphological analyzer based on multitask learning of
word-level tag markers for Hindi and Urdu. MT-DMA predicts a set of six
morphological tags for words of Indo-Aryan languages: Parts-of-speech (POS),
Gender (G), Number (N), Person (P), Case (C), Tense-Aspect-Modality (TAM)
marker as well as the Lemma (L) by jointly learning all these in one trainable
framework. We show the effectiveness of training of such deep neural networks
by the simultaneous optimization of multiple loss functions and sharing of
initial parameters for context-aware morphological analysis. Exploiting
character-level features in phonological space optimized for each tag using
multi-objective genetic algorithm, our model establishes a new state-of-the-art
accuracy score upon all seven of the tasks for both the languages. MT-DMA is
publicly accessible: code, models and data are available at
https://github.com/Saurav0074/morph_analyzer.Comment: 28 pages, 8 figures, 11 table
Morphological Embeddings for Named Entity Recognition in Morphologically Rich Languages
In this work, we present new state-of-the-art results of 93.59,% and 79.59,%
for Turkish and Czech named entity recognition based on the model of (Lample et
al., 2016). We contribute by proposing several schemes for representing the
morphological analysis of a word in the context of named entity recognition. We
show that a concatenation of this representation with the word and character
embeddings improves the performance. The effect of these representation schemes
on the tagging performance is also investigated.Comment: Working draf
First Result on Arabic Neural Machine Translation
Neural machine translation has become a major alternative to widely used
phrase-based statistical machine translation. We notice however that much of
research on neural machine translation has focused on European languages
despite its language agnostic nature. In this paper, we apply neural machine
translation to the task of Arabic translation (ArEn) and compare it against
a standard phrase-based translation system. We run extensive comparison using
various configurations in preprocessing Arabic script and show that the
phrase-based and neural translation systems perform comparably to each other
and that proper preprocessing of Arabic script has a similar effect on both of
the systems. We however observe that the neural machine translation
significantly outperform the phrase-based system on an out-of-domain test set,
making it attractive for real-world deployment.Comment: EMNLP submissio
Machine-Translation History and Evolution: Survey for Arabic-English Translations
As a result of the rapid changes in information and communication technology
(ICT), the world has become a small village where people from all over the
world connect with each other in dialogue and communication via the Internet.
Also, communications have become a daily routine activity due to the new
globalization where companies and even universities become global residing
cross countries borders. As a result, translation becomes a needed activity in
this connected world. ICT made it possible to have a student in one country
take a course or even a degree from a different country anytime anywhere
easily. The resulted communication still needs a language as a means that helps
the receiver understands the contents of the sent message. People need an
automated translation application because human translators are hard to find
all the times, and the human translations are very expensive comparing to the
translations automated process. Several types of research describe the
electronic process of the Machine-Translation. In this paper, the authors are
going to study some of these previous researches, and they will explore some of
the needed tools for the Machine-Translation. This research is going to
contribute to the Machine-Translation area by helping future researchers to
have a summary for the Machine-Translation groups of research and to let lights
on the importance of the translation mechanism.Comment: 19 pages, 5 figures, 3 tables, survey article pape
Morphological analysis using a sequence decoder
We introduce Morse, a recurrent encoder-decoder model that produces
morphological analyses of each word in a sentence. The encoder turns the
relevant information about the word and its context into a fixed size vector
representation and the decoder generates the sequence of characters for the
lemma followed by a sequence of individual morphological features. We show that
generating morphological features individually rather than as a combined tag
allows the model to handle rare or unseen tags and outperform whole-tag models.
In addition, generating morphological features as a sequence rather than e.g.\
an unordered set allows our model to produce an arbitrary number of features
that represent multiple inflectional groups in morphologically complex
languages. We obtain state-of-the art results in nine languages of different
morphological complexity under low-resource, high-resource and transfer
learning settings. We also introduce TrMor2018, a new high accuracy Turkish
morphology dataset. Our Morse implementation and the TrMor2018 dataset are
available online to support future research\footnote{See
\url{https://github.com/ai-ku/Morse.jl} for a Morse implementation in
Julia/Knet \cite{knet2016mlsys} and \url{https://github.com/ai-ku/TrMor2018}
for the new Turkish dataset.}.Comment: Final TACL versio
Analysis Methods in Neural Language Processing: A Survey
The field of natural language processing has seen impressive progress in
recent years, with neural network models replacing many of the traditional
systems. A plethora of new models have been proposed, many of which are thought
to be opaque compared to their feature-rich counterparts. This has led
researchers to analyze, interpret, and evaluate neural networks in novel and
more fine-grained ways. In this survey paper, we review analysis methods in
neural language processing, categorize them according to prominent research
trends, highlight existing limitations, and point to potential directions for
future work.Comment: Version including the supplementary materials (3 tables), also
available at https://boknilev.github.io/nlp-analysis-method
Improving Named Entity Recognition by Jointly Learning to Disambiguate Morphological Tags
Previous studies have shown that linguistic features of a word such as
possession, genitive or other grammatical cases can be employed in word
representations of a named entity recognition (NER) tagger to improve the
performance for morphologically rich languages. However, these taggers require
external morphological disambiguation (MD) tools to function which are hard to
obtain or non-existent for many languages. In this work, we propose a model
which alleviates the need for such disambiguators by jointly learning NER and
MD taggers in languages for which one can provide a list of candidate
morphological analyses. We show that this can be done independent of the
morphological annotation schemes, which differ among languages. Our experiments
employing three different model architectures that join these two tasks show
that joint learning improves NER performance. Furthermore, the morphological
disambiguator's performance is shown to be competitive.Comment: COLING 2018 (accepted
- …