16,821 research outputs found
A Continuously Growing Dataset of Sentential Paraphrases
A major challenge in paraphrase research is the lack of parallel corpora. In
this paper, we present a new method to collect large-scale sentential
paraphrases from Twitter by linking tweets through shared URLs. The main
advantage of our method is its simplicity, as it gets rid of the classifier or
human in the loop needed to select data before annotation and subsequent
application of paraphrase identification algorithms in the previous work. We
present the largest human-labeled paraphrase corpus to date of 51,524 sentence
pairs and the first cross-domain benchmarking for automatic paraphrase
identification. In addition, we show that more than 30,000 new sentential
paraphrases can be easily and continuously captured every month at ~70%
precision, and demonstrate their utility for downstream NLP tasks through
phrasal paraphrase extraction. We make our code and data freely available.Comment: 11 pages, accepted to EMNLP 201
Understanding Task Design Trade-offs in Crowdsourced Paraphrase Collection
Linguistically diverse datasets are critical for training and evaluating
robust machine learning systems, but data collection is a costly process that
often requires experts. Crowdsourcing the process of paraphrase generation is
an effective means of expanding natural language datasets, but there has been
limited analysis of the trade-offs that arise when designing tasks. In this
paper, we present the first systematic study of the key factors in
crowdsourcing paraphrase collection. We consider variations in instructions,
incentives, data domains, and workflows. We manually analyzed paraphrases for
correctness, grammaticality, and linguistic diversity. Our observations provide
new insight into the trade-offs between accuracy and diversity in crowd
responses that arise as a result of task design, providing guidance for future
paraphrase generation procedures.Comment: Published at ACL 201
ParaNMT-50M: Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations
We describe PARANMT-50M, a dataset of more than 50 million English-English
sentential paraphrase pairs. We generated the pairs automatically by using
neural machine translation to translate the non-English side of a large
parallel corpus, following Wieting et al. (2017). Our hope is that ParaNMT-50M
can be a valuable resource for paraphrase generation and can provide a rich
source of semantic knowledge to improve downstream natural language
understanding tasks. To show its utility, we use ParaNMT-50M to train
paraphrastic sentence embeddings that outperform all supervised systems on
every SemEval semantic textual similarity competition, in addition to showing
how it can be used for paraphrase generation
Multilingual Unsupervised Sentence Simplification
Progress in Sentence Simplification has been hindered by the lack of
supervised data, particularly in languages other than English. Previous work
has aligned sentences from original and simplified corpora such as English
Wikipedia and Simple English Wikipedia, but this limits corpus size, domain,
and language. In this work, we propose using unsupervised mining techniques to
automatically create training corpora for simplification in multiple languages
from raw Common Crawl web data. When coupled with a controllable generation
mechanism that can flexibly adjust attributes such as length and lexical
complexity, these mined paraphrase corpora can be used to train simplification
systems in any language. We further incorporate multilingual unsupervised
pretraining methods to create even stronger models and show that by training on
mined data rather than supervised corpora, we outperform the previous best
results. We evaluate our approach on English, French, and Spanish
simplification benchmarks and reach state-of-the-art performance with a totally
unsupervised approach. We will release our models and code to mine the data in
any language included in Common Crawl
A Deep Network Model for Paraphrase Detection in Short Text Messages
This paper is concerned with paraphrase detection. The ability to detect
similar sentences written in natural language is crucial for several
applications, such as text mining, text summarization, plagiarism detection,
authorship authentication and question answering. Given two sentences, the
objective is to detect whether they are semantically identical. An important
insight from this work is that existing paraphrase systems perform well when
applied on clean texts, but they do not necessarily deliver good performance
against noisy texts. Challenges with paraphrase detection on user generated
short texts, such as Twitter, include language irregularity and noise. To cope
with these challenges, we propose a novel deep neural network-based approach
that relies on coarse-grained sentence modeling using a convolutional neural
network and a long short-term memory model, combined with a specific
fine-grained word-level similarity matching model. Our experimental results
show that the proposed approach outperforms existing state-of-the-art
approaches on user-generated noisy social media data, such as Twitter texts,
and achieves highly competitive performance on a cleaner corpus
- …