Search CORE

1,201 research outputs found

Multilingual Unsupervised Sentence Simplification

Author: Bordes Antoine
de la Clergerie Éric
Fan Angela
Martin Louis
Sagot Benoît
Publication venue
Publication date: 01/05/2020
Field of study

Progress in Sentence Simplification has been hindered by the lack of supervised data, particularly in languages other than English. Previous work has aligned sentences from original and simplified corpora such as English Wikipedia and Simple English Wikipedia, but this limits corpus size, domain, and language. In this work, we propose using unsupervised mining techniques to automatically create training corpora for simplification in multiple languages from raw Common Crawl web data. When coupled with a controllable generation mechanism that can flexibly adjust attributes such as length and lexical complexity, these mined paraphrase corpora can be used to train simplification systems in any language. We further incorporate multilingual unsupervised pretraining methods to create even stronger models and show that by training on mined data rather than supervised corpora, we outperform the previous best results. We evaluate our approach on English, French, and Spanish simplification benchmarks and reach state-of-the-art performance with a totally unsupervised approach. We will release our models and code to mine the data in any language included in Common Crawl

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

A Unified multilingual semantic representation of concepts

Author: CAMACHO COLLADOS Jose'
Navigli Roberto
Pilehvar MOHAMMED TAHER
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2015
Field of study

Semantic representation lies at the core of several applications in Natural Language Processing. However, most existing semantic representation techniques cannot be used effectively for the representation of individual word senses. We put forward a novel multilingual concept representation, called MUFFIN , which not only enables accurate representation of word senses in different languages, but also provides multiple advantages over existing approaches. MUFFIN represents a given concept in a unified semantic space irrespective of the language of interest, enabling cross-lingual comparison of different concepts. We evaluate our approach in two different evaluation benchmarks, semantic similarity and Word Sense Disambiguation, reporting state-of-the-art performance on several standard datasets

CiteSeerX

Crossref

Online Research @ Cardiff

Archivio della ricerca- Università di Roma La Sapienza

Lexically Constrained Decoding for Sequence Generation Using Grid Beam Search

Author: Hokamp Chris
Liu Qun
Publication venue
Publication date: 01/01/2017
Field of study

We present Grid Beam Search (GBS), an algorithm which extends beam search to allow the inclusion of pre-specified lexical constraints. The algorithm can be used with any model that generates a sequence

\mathbf{\hat{y}} = \{y_{0}\ldots y_{T}\}

, by maximizing

p(\mathbf{y} | \mathbf{x}) = \prod\limits_{t}p(y_{t} | \mathbf{x}; \{y_{0} \ldots y_{t-1}\})

. Lexical constraints take the form of phrases or words that must be present in the output sequence. This is a very general way to incorporate additional knowledge into a model's output without requiring any modification of the model parameters or training data. We demonstrate the feasibility and flexibility of Lexically Constrained Decoding by conducting experiments on Neural Interactive-Predictive Translation, as well as Domain Adaptation for Neural Machine Translation. Experiments show that GBS can provide large improvements in translation quality in interactive scenarios, and that, even without any user input, GBS can be used to achieve significant gains in performance in domain adaptation scenarios.Comment: Accepted as a long paper at ACL 201

arXiv.org e-Print Archive

Crossref

Degrees of Orality in Speech-like Corpora: Comparative Annotation of Chat and E-mail Corpora

Author: Bick Eckhard
Publication venue: Institute of Digital Enhancement of Cognitive Processing, Waseda University
Publication date: 01/01/2011
Field of study

Waseda University Repository

University of Southern Denmark Research Output

Linking a domain thesaurus to WordNet and conversion to WordNet-LMF

Author: Bosma Wauter
Cuadros Oller Montserrat
Monachini Monica
Rigau Claramunt German
Soria Claudia
Toral Antonio
Vossen Piek
Publication venue
Publication date: 01/01/2010
Field of study

We present a methodology to link domain thesauri to general-domain lexica. This is applied in the framework of the KYOTO project to link the Species2000 thesaurus to the synsets of the English WordNet. Moreover, we study the formalisation of this thesaurus according to the ISO LMF standard and its dialect WordNet-LMF. This conversion will allow Species2000 to communicate with the other resources available in the KYOTO architecture.Peer ReviewedPostprint (published version

CiteSeerX

VU Research Portal

UPCommons. Portal del coneixement obert de la UPC

PUblication MAnagement