20 research outputs found
A Continuously Growing Dataset of Sentential Paraphrases
A major challenge in paraphrase research is the lack of parallel corpora. In
this paper, we present a new method to collect large-scale sentential
paraphrases from Twitter by linking tweets through shared URLs. The main
advantage of our method is its simplicity, as it gets rid of the classifier or
human in the loop needed to select data before annotation and subsequent
application of paraphrase identification algorithms in the previous work. We
present the largest human-labeled paraphrase corpus to date of 51,524 sentence
pairs and the first cross-domain benchmarking for automatic paraphrase
identification. In addition, we show that more than 30,000 new sentential
paraphrases can be easily and continuously captured every month at ~70%
precision, and demonstrate their utility for downstream NLP tasks through
phrasal paraphrase extraction. We make our code and data freely available.Comment: 11 pages, accepted to EMNLP 201
Extracting News Events from Microblogs
Twitter stream has become a large source of information for many people, but
the magnitude of tweets and the noisy nature of its content have made
harvesting the knowledge from Twitter a challenging task for researchers for a
long time. Aiming at overcoming some of the main challenges of extracting the
hidden information from tweet streams, this work proposes a new approach for
real-time detection of news events from the Twitter stream. We divide our
approach into three steps. The first step is to use a neural network or deep
learning to detect news-relevant tweets from the stream. The second step is to
apply a novel streaming data clustering algorithm to the detected news tweets
to form news events. The third and final step is to rank the detected events
based on the size of the event clusters and growth speed of the tweet
frequencies. We evaluate the proposed system on a large, publicly available
corpus of annotated news events from Twitter. As part of the evaluation, we
compare our approach with a related state-of-the-art solution. Overall, our
experiments and user-based evaluation show that our approach on detecting
current (real) news events delivers a state-of-the-art performance
Neural-Driven Multi-criteria Tree Search for Paraphrase Generation
International audienceA good paraphrase is semantically similar to the original sentence but it must be also well formed, and syntactically different to ensure diversity. To deal with this tradeoff, we propose to cast the paraphrase generation task as a multi-objectives search problem on the lattice of text transformations. We use BERT and GPT2 to measure respectively the semantic distance and the correctness of the candidates. We study two search algorithms: Monte-Carlo Tree Search For Paraphrase Generation (MCPG) and Pareto Tree Search (PTS) that we use to explore the huge sets of candidates generated by applying the PPDB-2.0 edition rules. We evaluate this approach on 5 datasets and show that it performs reasonably well and that it outperforms a state-of-the-art edition-based text generation method
Generating Syntactic Paraphrases
International audienceWe study the automatic generation of syntactic paraphrases using four different models for generation: data-to-text generation, textto-text generation, text reduction and text expansion, We derive training data for each of these tasks from the WebNLG dataset and we show (i) that conditioning generation on syntactic constraints effectively permits the generation of syntactically distinct paraphrases for the same input and (ii) that exploiting different types of input (data, text or data+text) further increases the number of distinct paraphrases that can be generated for a given input