5,774 research outputs found
Example-based machine translation of the Basque language
Basque is both a minority and a highly inflected language with free order of sentence constituents. Machine Translation of Basque is thus both a real need and a test bed for MT techniques. In this paper, we present a modular Data-Driven MT system which includes different chunkers as well as chunk aligners which can deal with the free order of sentence constituents of Basque. We conducted Basque to English translation experiments, evaluated on a large corpus
(270, 000 sentence pairs). The experimental results show that our system significantly outperforms state-of-the-art
approaches according to several common automatic evaluation metrics
A Survey of Word Reordering in Statistical Machine Translation: Computational Models and Language Phenomena
Word reordering is one of the most difficult aspects of statistical machine
translation (SMT), and an important factor of its quality and efficiency.
Despite the vast amount of research published to date, the interest of the
community in this problem has not decreased, and no single method appears to be
strongly dominant across language pairs. Instead, the choice of the optimal
approach for a new translation task still seems to be mostly driven by
empirical trials. To orientate the reader in this vast and complex research
area, we present a comprehensive survey of word reordering viewed as a
statistical modeling challenge and as a natural language phenomenon. The survey
describes in detail how word reordering is modeled within different
string-based and tree-based SMT frameworks and as a stand-alone task, including
systematic overviews of the literature in advanced reordering modeling. We then
question why some approaches are more successful than others in different
language pairs. We argue that, besides measuring the amount of reordering, it
is important to understand which kinds of reordering occur in a given language
pair. To this end, we conduct a qualitative analysis of word reordering
phenomena in a diverse sample of language pairs, based on a large collection of
linguistic knowledge. Empirical results in the SMT literature are shown to
support the hypothesis that a few linguistic facts can be very useful to
anticipate the reordering characteristics of a language pair and to select the
SMT framework that best suits them.Comment: 44 pages, to appear in Computational Linguistic
Detecting and Explaining Causes From Text For a Time Series Event
Explaining underlying causes or effects about events is a challenging but
valuable task. We define a novel problem of generating explanations of a time
series event by (1) searching cause and effect relationships of the time series
with textual data and (2) constructing a connecting chain between them to
generate an explanation. To detect causal features from text, we propose a
novel method based on the Granger causality of time series between features
extracted from text such as N-grams, topics, sentiments, and their composition.
The generation of the sequence of causal entities requires a commonsense
causative knowledge base with efficient reasoning. To ensure good
interpretability and appropriate lexical usage we combine symbolic and neural
representations, using a neural reasoning algorithm trained on commonsense
causal tuples to predict the next cause step. Our quantitative and human
analysis show empirical evidence that our method successfully extracts
meaningful causality relationships between time series with textual features
and generates appropriate explanation between them.Comment: Accepted at EMNLP 201
Weakly-Supervised Alignment of Video With Text
Suppose that we are given a set of videos, along with natural language
descriptions in the form of multiple sentences (e.g., manual annotations, movie
scripts, sport summaries etc.), and that these sentences appear in the same
temporal order as their visual counterparts. We propose in this paper a method
for aligning the two modalities, i.e., automatically providing a time stamp for
every sentence. Given vectorial features for both video and text, we propose to
cast this task as a temporal assignment problem, with an implicit linear
mapping between the two feature modalities. We formulate this problem as an
integer quadratic program, and solve its continuous convex relaxation using an
efficient conditional gradient algorithm. Several rounding procedures are
proposed to construct the final integer solution. After demonstrating
significant improvements over the state of the art on the related task of
aligning video with symbolic labels [7], we evaluate our method on a
challenging dataset of videos with associated textual descriptions [36], using
both bag-of-words and continuous representations for text.Comment: ICCV 2015 - IEEE International Conference on Computer Vision, Dec
2015, Santiago, Chil
Wronging a Right: Generating Better Errors to Improve Grammatical Error Detection
Grammatical error correction, like other machine learning tasks, greatly
benefits from large quantities of high quality training data, which is
typically expensive to produce. While writing a program to automatically
generate realistic grammatical errors would be difficult, one could learn the
distribution of naturallyoccurring errors and attempt to introduce them into
other datasets. Initial work on inducing errors in this way using statistical
machine translation has shown promise; we investigate cheaply constructing
synthetic samples, given a small corpus of human-annotated data, using an
off-the-rack attentive sequence-to-sequence model and a straight-forward
post-processing procedure. Our approach yields error-filled artificial data
that helps a vanilla bi-directional LSTM to outperform the previous state of
the art at grammatical error detection, and a previously introduced model to
gain further improvements of over 5% score. When attempting to
determine if a given sentence is synthetic, a human annotator at best achieves
39.39 score, indicating that our model generates mostly human-like
instances.Comment: Accepted as a short paper at EMNLP 201
- …