26,189 research outputs found
Induction of Word and Phrase Alignments for Automatic Document Summarization
Current research in automatic single document summarization is dominated by
two effective, yet naive approaches: summarization by sentence extraction, and
headline generation via bag-of-words models. While successful in some tasks,
neither of these models is able to adequately capture the large set of
linguistic devices utilized by humans when they produce summaries. One possible
explanation for the widespread use of these models is that good techniques have
been developed to extract appropriate training data for them from existing
document/abstract and document/headline corpora. We believe that future
progress in automatic summarization will be driven both by the development of
more sophisticated, linguistically informed models, as well as a more effective
leveraging of document/abstract corpora. In order to open the doors to
simultaneously achieving both of these goals, we have developed techniques for
automatically producing word-to-word and phrase-to-phrase alignments between
documents and their human-written abstracts. These alignments make explicit the
correspondences that exist in such document/abstract pairs, and create a
potentially rich data source from which complex summarization algorithms may
learn. This paper describes experiments we have carried out to analyze the
ability of humans to perform such alignments, and based on these analyses, we
describe experiments for creating them automatically. Our model for the
alignment task is based on an extension of the standard hidden Markov model,
and learns to create alignments in a completely unsupervised fashion. We
describe our model in detail and present experimental results that show that
our model is able to learn to reliably identify word- and phrase-level
alignments in a corpus of pairs
A Survey of Paraphrasing and Textual Entailment Methods
Paraphrasing methods recognize, generate, or extract phrases, sentences, or
longer natural language expressions that convey almost the same information.
Textual entailment methods, on the other hand, recognize, generate, or extract
pairs of natural language expressions, such that a human who reads (and trusts)
the first element of a pair would most likely infer that the other element is
also true. Paraphrasing can be seen as bidirectional textual entailment and
methods from the two areas are often similar. Both kinds of methods are useful,
at least in principle, in a wide range of natural language processing
applications, including question answering, summarization, text generation, and
machine translation. We summarize key ideas from the two areas by considering
in turn recognition, generation, and extraction methods, also pointing to
prominent articles and resources.Comment: Technical Report, Natural Language Processing Group, Department of
Informatics, Athens University of Economics and Business, Greece, 201
Bayesian reordering model with feature selection
In phrase-based statistical machine translation systems, variation in grammatical structures between source and target languages can cause large movements of phrases. Modeling such movements is crucial in achieving translations of long sentences that appear natural in the target language. We explore generative learning approach to phrase reordering in Arabic to English. Formulating the reordering problem as a classification problem and using naive Bayes with feature selection, we achieve an improvement in the BLEU score over a lexicalized reordering model. The proposed model is compact, fast and scalable to a large corpus
Non-Autoregressive Neural Machine Translation with Enhanced Decoder Input
Non-autoregressive translation (NAT) models, which remove the dependence on
previous target tokens from the inputs of the decoder, achieve significantly
inference speedup but at the cost of inferior accuracy compared to
autoregressive translation (AT) models. Previous work shows that the quality of
the inputs of the decoder is important and largely impacts the model accuracy.
In this paper, we propose two methods to enhance the decoder inputs so as to
improve NAT models. The first one directly leverages a phrase table generated
by conventional SMT approaches to translate source tokens to target tokens,
which are then fed into the decoder as inputs. The second one transforms
source-side word embeddings to target-side word embeddings through
sentence-level alignment and word-level adversary learning, and then feeds the
transformed word embeddings into the decoder as inputs. Experimental results
show our method largely outperforms the NAT baseline~\citep{gu2017non} by
BLEU scores on WMT14 English-German task and BLEU scores on WMT16
English-Romanian task.Comment: AAAI 201
A memory-based classification approach to marker-based EBMT
We describe a novel approach to example-based machine translation that makes use of marker-based chunks, in which the decoder is a memory-based classifier. The classifier is trained to map trigrams of source-language chunks onto trigrams of target-language chunks; then, in a second
decoding step, the predicted trigrams are rearranged according to their overlap. We present the first results of this method on a Dutch-to-English translation system
using Europarl data. Sparseness of the class space causes the results to lag behind a baseline phrase-based SMT system.
In a further comparison, we also
apply the method to a word-aligned version
of the same data, and report a smaller
difference with a word-based SMT system.
We explore the scaling abilities of the
memory-based approach, and observe linear
scaling behavior in training and classification
speed and memory costs, and loglinear
BLEU improvements in the amount
of training examples
Detecting and Explaining Causes From Text For a Time Series Event
Explaining underlying causes or effects about events is a challenging but
valuable task. We define a novel problem of generating explanations of a time
series event by (1) searching cause and effect relationships of the time series
with textual data and (2) constructing a connecting chain between them to
generate an explanation. To detect causal features from text, we propose a
novel method based on the Granger causality of time series between features
extracted from text such as N-grams, topics, sentiments, and their composition.
The generation of the sequence of causal entities requires a commonsense
causative knowledge base with efficient reasoning. To ensure good
interpretability and appropriate lexical usage we combine symbolic and neural
representations, using a neural reasoning algorithm trained on commonsense
causal tuples to predict the next cause step. Our quantitative and human
analysis show empirical evidence that our method successfully extracts
meaningful causality relationships between time series with textual features
and generates appropriate explanation between them.Comment: Accepted at EMNLP 201
- ā¦