95 research outputs found
A Survey of Word Reordering in Statistical Machine Translation: Computational Models and Language Phenomena
Word reordering is one of the most difficult aspects of statistical machine
translation (SMT), and an important factor of its quality and efficiency.
Despite the vast amount of research published to date, the interest of the
community in this problem has not decreased, and no single method appears to be
strongly dominant across language pairs. Instead, the choice of the optimal
approach for a new translation task still seems to be mostly driven by
empirical trials. To orientate the reader in this vast and complex research
area, we present a comprehensive survey of word reordering viewed as a
statistical modeling challenge and as a natural language phenomenon. The survey
describes in detail how word reordering is modeled within different
string-based and tree-based SMT frameworks and as a stand-alone task, including
systematic overviews of the literature in advanced reordering modeling. We then
question why some approaches are more successful than others in different
language pairs. We argue that, besides measuring the amount of reordering, it
is important to understand which kinds of reordering occur in a given language
pair. To this end, we conduct a qualitative analysis of word reordering
phenomena in a diverse sample of language pairs, based on a large collection of
linguistic knowledge. Empirical results in the SMT literature are shown to
support the hypothesis that a few linguistic facts can be very useful to
anticipate the reordering characteristics of a language pair and to select the
SMT framework that best suits them.Comment: 44 pages, to appear in Computational Linguistic
Recurrent Memory Networks for Language Modeling
Recurrent Neural Networks (RNN) have obtained excellent result in many
natural language processing (NLP) tasks. However, understanding and
interpreting the source of this success remains a challenge. In this paper, we
propose Recurrent Memory Network (RMN), a novel RNN architecture, that not only
amplifies the power of RNN but also facilitates our understanding of its
internal functioning and allows us to discover underlying patterns in data. We
demonstrate the power of RMN on language modeling and sentence completion
tasks. On language modeling, RMN outperforms Long Short-Term Memory (LSTM)
network on three large German, Italian, and English dataset. Additionally we
perform in-depth analysis of various linguistic dimensions that RMN captures.
On Sentence Completion Challenge, for which it is essential to capture sentence
coherence, our RMN obtains 69.2% accuracy, surpassing the previous
state-of-the-art by a large margin.Comment: 8 pages, 6 figures. Accepted at NAACL 201
Learning Topic-Sensitive Word Representations
Distributed word representations are widely used for modeling words in NLP
tasks. Most of the existing models generate one representation per word and do
not consider different meanings of a word. We present two approaches to learn
multiple topic-sensitive representations per word by using Hierarchical
Dirichlet Process. We observe that by modeling topics and integrating topic
distributions for each document we obtain representations that are able to
distinguish between different meanings of a given word. Our models yield
statistically significant improvements for the lexical substitution task
indicating that commonly used single word representations, even when combined
with contextual information, are insufficient for this task.Comment: 5 pages, 1 figure, Accepted at ACL 201
Examining the Tip of the Iceberg: A Data Set for Idiom Translation
Neural Machine Translation (NMT) has been widely used in recent years with
significant improvements for many language pairs. Although state-of-the-art NMT
systems are generating progressively better translations, idiom translation
remains one of the open challenges in this field. Idioms, a category of
multiword expressions, are an interesting language phenomenon where the overall
meaning of the expression cannot be composed from the meanings of its parts. A
first important challenge is the lack of dedicated data sets for learning and
evaluating idiom translation. In this paper we address this problem by creating
the first large-scale data set for idiom translation. Our data set is
automatically extracted from a widely used German-English translation corpus
and includes, for each language direction, a targeted evaluation set where all
sentences contain idioms and a regular training corpus where sentences
including idioms are marked. We release this data set and use it to perform
preliminary NMT experiments as the first step towards better idiom translation.Comment: Accepted at LREC 201
Data Augmentation for Low-Resource Neural Machine Translation
The quality of a Neural Machine Translation system depends substantially on
the availability of sizable parallel corpora. For low-resource language pairs
this is not the case, resulting in poor translation quality. Inspired by work
in computer vision, we propose a novel data augmentation approach that targets
low-frequency words by generating new sentence pairs containing rare words in
new, synthetically created contexts. Experimental results on simulated
low-resource settings show that our method improves translation quality by up
to 2.9 BLEU points over the baseline and up to 3.2 BLEU over back-translation.Comment: 5 pages, 2 figures, Accepted at ACL 201
An Open-Domain Dialog Act Taxonomy
This document defines the taxonomy of dialog acts that are necessary to encode domain-independent dialog moves in the context of a task-oriented, open-domain dialog. Such taxonomy is formulated to satisfy two complementary requirements: on the one hand, domain independence, i.e. the power to cover all the range of possible interactions in any type of conversation (particularly conversation oriented to the performance of tasks). On the other hand, the ability to instantiate a concrete set of tasks as defined by a specific knowledge base (such as an ontology of domain concepts and actions) and within a particular language. For the modeling of dialog acts, inspiration is taken from several well-known dialog annotation schemes, such as DAMSL (Core & Allen, 1997), TRAINS (Traum, 1996) and VERBMOBIL (Alexandersson et al., 1997)
dynamically shaping the reordering search space of phrase based statistical machine translation
Defining the reordering search space is a crucial issue in phrase-based SMT between distant languages. In fact, the optimal trade-off between accuracy and complexity of decoding is nowadays reached by harshly limiting the input permutation space. We propose a method to dynamically shape such space and, thus, capture long-range word movements without hurting translation quality nor decoding time. The space defined by loose reordering constraints is dynamically pruned through a binary classifier that predicts whether a given input word should be translated right after another. The integration of this model into a phrase-based decoder improves a strong Arabic-English baseline already including state-of-the-art early distortion cost (Moore and Quirk, 2007) and hierarchical phrase orientation models (Galley and Manning, 2008). Significant improvements in the reordering of verbs are achieved by a system that is notably faster than the baseline, while bleu and meteor remain stable, or even increase, at a very high distortion limit
Zero-shot Dependency Parsing with Pre-trained Multilingual Sentence Representations
We investigate whether off-the-shelf deep bidirectional sentence
representations trained on a massively multilingual corpus (multilingual BERT)
enable the development of an unsupervised universal dependency parser. This
approach only leverages a mix of monolingual corpora in many languages and does
not require any translation data making it applicable to low-resource
languages. In our experiments we outperform the best CoNLL 2018
language-specific systems in all of the shared task's six truly low-resource
languages while using a single system. However, we also find that (i) parsing
accuracy still varies dramatically when changing the training languages and
(ii) in some target languages zero-shot transfer fails under all tested
conditions, raising concerns on the 'universality' of the whole approach.Comment: DeepLo workshop, EMNLP 201
- …