1,201 research outputs found
Multilingual Unsupervised Sentence Simplification
Progress in Sentence Simplification has been hindered by the lack of
supervised data, particularly in languages other than English. Previous work
has aligned sentences from original and simplified corpora such as English
Wikipedia and Simple English Wikipedia, but this limits corpus size, domain,
and language. In this work, we propose using unsupervised mining techniques to
automatically create training corpora for simplification in multiple languages
from raw Common Crawl web data. When coupled with a controllable generation
mechanism that can flexibly adjust attributes such as length and lexical
complexity, these mined paraphrase corpora can be used to train simplification
systems in any language. We further incorporate multilingual unsupervised
pretraining methods to create even stronger models and show that by training on
mined data rather than supervised corpora, we outperform the previous best
results. We evaluate our approach on English, French, and Spanish
simplification benchmarks and reach state-of-the-art performance with a totally
unsupervised approach. We will release our models and code to mine the data in
any language included in Common Crawl
A Unified multilingual semantic representation of concepts
Semantic representation lies at the core of several applications in Natural Language Processing. However, most existing semantic representation techniques cannot be used effectively for the representation of individual word senses. We put forward a novel multilingual concept representation, called MUFFIN , which not only enables accurate representation of word senses in different languages, but also provides multiple advantages over existing approaches. MUFFIN represents a given concept in a unified semantic space irrespective of the language of interest, enabling cross-lingual comparison of different concepts. We evaluate our approach in two different evaluation benchmarks, semantic similarity and Word Sense Disambiguation, reporting state-of-the-art performance on several standard datasets
Lexically Constrained Decoding for Sequence Generation Using Grid Beam Search
We present Grid Beam Search (GBS), an algorithm which extends beam search to
allow the inclusion of pre-specified lexical constraints. The algorithm can be
used with any model that generates a sequence , by maximizing . Lexical
constraints take the form of phrases or words that must be present in the
output sequence. This is a very general way to incorporate additional knowledge
into a model's output without requiring any modification of the model
parameters or training data. We demonstrate the feasibility and flexibility of
Lexically Constrained Decoding by conducting experiments on Neural
Interactive-Predictive Translation, as well as Domain Adaptation for Neural
Machine Translation. Experiments show that GBS can provide large improvements
in translation quality in interactive scenarios, and that, even without any
user input, GBS can be used to achieve significant gains in performance in
domain adaptation scenarios.Comment: Accepted as a long paper at ACL 201
Linking a domain thesaurus to WordNet and conversion to WordNet-LMF
We present a methodology to link domain
thesauri to general-domain lexica. This is
applied in the framework of the KYOTO
project to link the Species2000 thesaurus
to the synsets of the English WordNet.
Moreover, we study the formalisation of
this thesaurus according to the ISO LMF
standard and its dialect WordNet-LMF.
This conversion will allow Species2000
to communicate with the other resources
available in the KYOTO architecture.Peer ReviewedPostprint (published version
- …