284 research outputs found
Establishing a New State-of-the-Art for French Named Entity Recognition
The French TreeBank developed at the University Paris 7 is the main source of
morphosyntactic and syntactic annotations for French. However, it does not
include explicit information related to named entities, which are among the
most useful information for several natural language processing tasks and
applications. Moreover, no large-scale French corpus with named entity
annotations contain referential information, which complement the type and the
span of each mention with an indication of the entity it refers to. We have
manually annotated the French TreeBank with such information, after an
automatic pre-annotation step. We sketch the underlying annotation guidelines
and we provide a few figures about the resulting annotations
A Survey of Paraphrasing and Textual Entailment Methods
Paraphrasing methods recognize, generate, or extract phrases, sentences, or
longer natural language expressions that convey almost the same information.
Textual entailment methods, on the other hand, recognize, generate, or extract
pairs of natural language expressions, such that a human who reads (and trusts)
the first element of a pair would most likely infer that the other element is
also true. Paraphrasing can be seen as bidirectional textual entailment and
methods from the two areas are often similar. Both kinds of methods are useful,
at least in principle, in a wide range of natural language processing
applications, including question answering, summarization, text generation, and
machine translation. We summarize key ideas from the two areas by considering
in turn recognition, generation, and extraction methods, also pointing to
prominent articles and resources.Comment: Technical Report, Natural Language Processing Group, Department of
Informatics, Athens University of Economics and Business, Greece, 201
Supersense Tagging for Arabic: the MT-in-the-Middle Attack
We consider the task of tagging Arabic nouns with WordNet supersenses. Three approaches are evaluated. The first uses an expertcrafted but limited-coverage lexicon, Arabic WordNet, and heuristics. The second uses unsupervised sequence modeling. The third and most successful approach uses machine translation to translate the Arabic into English, which is automatically tagged with English supersenses, the results of which are then projected back into Arabic. Analysis shows gains and remaining obstacles in four Wikipedia topical domains.
A Survey of Word Reordering in Statistical Machine Translation: Computational Models and Language Phenomena
Word reordering is one of the most difficult aspects of statistical machine
translation (SMT), and an important factor of its quality and efficiency.
Despite the vast amount of research published to date, the interest of the
community in this problem has not decreased, and no single method appears to be
strongly dominant across language pairs. Instead, the choice of the optimal
approach for a new translation task still seems to be mostly driven by
empirical trials. To orientate the reader in this vast and complex research
area, we present a comprehensive survey of word reordering viewed as a
statistical modeling challenge and as a natural language phenomenon. The survey
describes in detail how word reordering is modeled within different
string-based and tree-based SMT frameworks and as a stand-alone task, including
systematic overviews of the literature in advanced reordering modeling. We then
question why some approaches are more successful than others in different
language pairs. We argue that, besides measuring the amount of reordering, it
is important to understand which kinds of reordering occur in a given language
pair. To this end, we conduct a qualitative analysis of word reordering
phenomena in a diverse sample of language pairs, based on a large collection of
linguistic knowledge. Empirical results in the SMT literature are shown to
support the hypothesis that a few linguistic facts can be very useful to
anticipate the reordering characteristics of a language pair and to select the
SMT framework that best suits them.Comment: 44 pages, to appear in Computational Linguistic
Joint Dropout: Improving Generalizability in Low-Resource Neural Machine Translation through Phrase Pair Variables
Despite the tremendous success of Neural Machine Translation (NMT), its
performance on low-resource language pairs still remains subpar, partly due to
the limited ability to handle previously unseen inputs, i.e., generalization.
In this paper, we propose a method called Joint Dropout, that addresses the
challenge of low-resource neural machine translation by substituting phrases
with variables, resulting in significant enhancement of compositionality, which
is a key aspect of generalization. We observe a substantial improvement in
translation quality for language pairs with minimal resources, as seen in BLEU
and Direct Assessment scores. Furthermore, we conduct an error analysis, and
find Joint Dropout to also enhance generalizability of low-resource NMT in
terms of robustness and adaptability across different domainsComment: Accepted at MT Summit 202
- …