20 research outputs found

    A Continuously Growing Dataset of Sentential Paraphrases

    Full text link
    A major challenge in paraphrase research is the lack of parallel corpora. In this paper, we present a new method to collect large-scale sentential paraphrases from Twitter by linking tweets through shared URLs. The main advantage of our method is its simplicity, as it gets rid of the classifier or human in the loop needed to select data before annotation and subsequent application of paraphrase identification algorithms in the previous work. We present the largest human-labeled paraphrase corpus to date of 51,524 sentence pairs and the first cross-domain benchmarking for automatic paraphrase identification. In addition, we show that more than 30,000 new sentential paraphrases can be easily and continuously captured every month at ~70% precision, and demonstrate their utility for downstream NLP tasks through phrasal paraphrase extraction. We make our code and data freely available.Comment: 11 pages, accepted to EMNLP 201

    Sentential Paraphrase Generation for Agglutinative Languages Using SVM with a String Kernel

    Get PDF

    Domain-Independent Novel Event Discovery and Semi-Automatic Event Annotation

    Get PDF

    Comparing Phrase-based and Syntax-based Paraphrase Generation

    Get PDF

    Extracting News Events from Microblogs

    Full text link
    Twitter stream has become a large source of information for many people, but the magnitude of tweets and the noisy nature of its content have made harvesting the knowledge from Twitter a challenging task for researchers for a long time. Aiming at overcoming some of the main challenges of extracting the hidden information from tweet streams, this work proposes a new approach for real-time detection of news events from the Twitter stream. We divide our approach into three steps. The first step is to use a neural network or deep learning to detect news-relevant tweets from the stream. The second step is to apply a novel streaming data clustering algorithm to the detected news tweets to form news events. The third and final step is to rank the detected events based on the size of the event clusters and growth speed of the tweet frequencies. We evaluate the proposed system on a large, publicly available corpus of annotated news events from Twitter. As part of the evaluation, we compare our approach with a related state-of-the-art solution. Overall, our experiments and user-based evaluation show that our approach on detecting current (real) news events delivers a state-of-the-art performance

    Neural-Driven Multi-criteria Tree Search for Paraphrase Generation

    Get PDF
    International audienceA good paraphrase is semantically similar to the original sentence but it must be also well formed, and syntactically different to ensure diversity. To deal with this tradeoff, we propose to cast the paraphrase generation task as a multi-objectives search problem on the lattice of text transformations. We use BERT and GPT2 to measure respectively the semantic distance and the correctness of the candidates. We study two search algorithms: Monte-Carlo Tree Search For Paraphrase Generation (MCPG) and Pareto Tree Search (PTS) that we use to explore the huge sets of candidates generated by applying the PPDB-2.0 edition rules. We evaluate this approach on 5 datasets and show that it performs reasonably well and that it outperforms a state-of-the-art edition-based text generation method

    Generating Syntactic Paraphrases

    Get PDF
    International audienceWe study the automatic generation of syntactic paraphrases using four different models for generation: data-to-text generation, textto-text generation, text reduction and text expansion, We derive training data for each of these tasks from the WebNLG dataset and we show (i) that conditioning generation on syntactic constraints effectively permits the generation of syntactically distinct paraphrases for the same input and (ii) that exploiting different types of input (data, text or data+text) further increases the number of distinct paraphrases that can be generated for a given input
    corecore