263 research outputs found
Semi-Supervised Learning for Neural Keyphrase Generation
We study the problem of generating keyphrases that summarize the key points
for a given document. While sequence-to-sequence (seq2seq) models have achieved
remarkable performance on this task (Meng et al., 2017), model training often
relies on large amounts of labeled data, which is only applicable to
resource-rich domains. In this paper, we propose semi-supervised keyphrase
generation methods by leveraging both labeled data and large-scale unlabeled
samples for learning. Two strategies are proposed. First, unlabeled documents
are first tagged with synthetic keyphrases obtained from unsupervised keyphrase
extraction methods or a selflearning algorithm, and then combined with labeled
samples for training. Furthermore, we investigate a multi-task learning
framework to jointly learn to generate keyphrases as well as the titles of the
articles. Experimental results show that our semi-supervised learning-based
methods outperform a state-of-the-art model trained with labeled data only.Comment: To appear in EMNLP 2018 (12 pages, 7 figures, 6 tables
Neural Automatic Post-Editing Using Prior Alignment and Reranking
We present a second-stage machine translation (MT) system based on a neural machine translation (NMT) approach to automatic post-editing (APE) that improves the translation quality provided by a firststage MT system. Our APE system (AP ESym) is an extended version of an attention based NMT model with bilingual
symmetry employing bidirectional models, mt → pe and pe → mt. APE translations produced by our system show statistically significant improvements over the first-stage MT, phrase-based APE and the best reported score on the WMT 2016 APE dataset by a previous neural APE system. Re-ranking (AP ERerank) of the
n-best translations from the phrase-based APE and AP ESym systems provides further substantial improvements over the symmetric neural APE model. Human evaluation confirms that the AP ERerank
generated PE translations improve on the previous best neural APE system at WMT 2016.Santanu Pal is supported by the People Programme (Marie Curie Actions) of the European Union’s Framework Programme (FP7/2007-2013) under REA grant agreement no 317471. Sudip Kumar Naskar is supported by Media Lab Asia, MeitY, Government of India, under the Young Faculty Research Fellowship of the Visvesvaraya PhD Scheme for Electronics & IT.
Qun Liu and Josef van Genabith is supported by funding from the European Union Horizon 2020
research and innovation programme under grant agreement no 645452 (QT21)
Towards String-to-Tree Neural Machine Translation
We present a simple method to incorporate syntactic information about the
target language in a neural machine translation system by translating into
linearized, lexicalized constituency trees. An experiment on the WMT16
German-English news translation task resulted in an improved BLEU score when
compared to a syntax-agnostic NMT baseline trained on the same dataset. An
analysis of the translations from the syntax-aware system shows that it
performs more reordering during translation in comparison to the baseline. A
small-scale human evaluation also showed an advantage to the syntax-aware
system.Comment: Accepted as a short paper in ACL 201
Distantly Supervised Morpho-Syntactic Model for Relation Extraction
The task of Information Extraction (IE) involves automatically converting
unstructured textual content into structured data. Most research in this field
concentrates on extracting all facts or a specific set of relationships from
documents. In this paper, we present a method for the extraction and
categorisation of an unrestricted set of relationships from text. Our method
relies on morpho-syntactic extraction patterns obtained by a distant
supervision method, and creates Syntactic and Semantic Indices to extract and
classify candidate graphs. We evaluate our approach on six datasets built on
Wikidata and Wikipedia. The evaluation shows that our approach can achieve
Precision scores of up to 0.85, but with lower Recall and F1 scores. Our
approach allows to quickly create rule-based systems for Information Extraction
and to build annotated datasets to train machine-learning and deep-learning
based classifiers.Comment: Preprin
Reordering in statistical machine translation
PhDMachine translation is a challenging task that its difficulties arise from several characteristics
of natural language. The main focus of this work is on reordering as one of
the major problems in MT and statistical MT, which is the method investigated in this
research. The reordering problem in SMT originates from the fact that not all the words
in a sentence can be consecutively translated. This means words must be skipped and
be translated out of their order in the source sentence to produce a fluent and grammatically
correct sentence in the target language. The main reason that reordering is
needed is the fundamental word order differences between languages. Therefore, reordering
becomes a more dominant issue, the more source and target languages are
structurally different.
The aim of this thesis is to study the reordering phenomenon by proposing new methods
of dealing with reordering in SMT decoders and evaluating the effectiveness of
the methods and the importance of reordering in the context of natural language processing
tasks. In other words, we propose novel ways of performing the decoding to
improve the reordering capabilities of the SMT decoder and in addition we explore
the effect of improving the reordering on the quality of specific NLP tasks, namely
named entity recognition and cross-lingual text association. Meanwhile, we go beyond
reordering in text association and present a method to perform cross-lingual text fragment
alignment, based on models of divergence from randomness.
The main contribution of this thesis is a novel method named dynamic distortion,
which is designed to improve the ability of the phrase-based decoder in performing
reordering by adjusting the distortion parameter based on the translation context. The
model employs a discriminative reordering model, which is combining several fea-
2
tures including lexical and syntactic, to predict the necessary distortion limit for each
sentence and each hypothesis expansion. The discriminative reordering model is also
integrated into the decoder as an extra feature. The method achieves substantial improvements
over the baseline without increase in the decoding time by avoiding reordering
in unnecessary positions.
Another novel method is also presented to extend the phrase-based decoder to dynamically
chunk, reorder, and apply phrase translations in tandem. Words inside the chunks
are moved together to enable the decoder to make long-distance reorderings to capture
the word order differences between languages with different sentence structures.
Another aspect of this work is the task-based evaluation of the reordering methods and
other translation algorithms used in the phrase-based SMT systems. With more successful
SMT systems, performing multi-lingual and cross-lingual tasks through translating
becomes more feasible. We have devised a method to evaluate the performance
of state-of-the art named entity recognisers on the text translated by a SMT decoder.
Specifically, we investigated the effect of word reordering and incorporating reordering
models in improving the quality of named entity extraction.
In addition to empirically investigating the effect of translation in the context of crosslingual
document association, we have described a text fragment alignment algorithm
to find sections of the two documents in different languages, that are content-wise related.
The algorithm uses similarity measures based on divergence from randomness
and word-based translation models to perform text fragment alignment on a collection
of documents in two different languages.
All the methods proposed in this thesis are extensively empirically examined. We have
tested all the algorithms on common translation collections used in different evaluation
campaigns. Well known automatic evaluation metrics are used to compare the
suggested methods to a state-of-the art baseline and results are analysed and discussed
- …