238 research outputs found

    Neural Paraphrase Identification of Questions with Noisy Pretraining

    Full text link
    We present a solution to the problem of paraphrase identification of questions. We focus on a recent dataset of question pairs annotated with binary paraphrase labels and show that a variant of the decomposable attention model (Parikh et al., 2016) results in accurate performance on this task, while being far simpler than many competing neural architectures. Furthermore, when the model is pretrained on a noisy dataset of automatically collected question paraphrases, it obtains the best reported performance on the dataset

    Multilingual Unsupervised Sentence Simplification

    Full text link
    Progress in Sentence Simplification has been hindered by the lack of supervised data, particularly in languages other than English. Previous work has aligned sentences from original and simplified corpora such as English Wikipedia and Simple English Wikipedia, but this limits corpus size, domain, and language. In this work, we propose using unsupervised mining techniques to automatically create training corpora for simplification in multiple languages from raw Common Crawl web data. When coupled with a controllable generation mechanism that can flexibly adjust attributes such as length and lexical complexity, these mined paraphrase corpora can be used to train simplification systems in any language. We further incorporate multilingual unsupervised pretraining methods to create even stronger models and show that by training on mined data rather than supervised corpora, we outperform the previous best results. We evaluate our approach on English, French, and Spanish simplification benchmarks and reach state-of-the-art performance with a totally unsupervised approach. We will release our models and code to mine the data in any language included in Common Crawl

    A Deep Network Model for Paraphrase Detection in Short Text Messages

    Full text link
    This paper is concerned with paraphrase detection. The ability to detect similar sentences written in natural language is crucial for several applications, such as text mining, text summarization, plagiarism detection, authorship authentication and question answering. Given two sentences, the objective is to detect whether they are semantically identical. An important insight from this work is that existing paraphrase systems perform well when applied on clean texts, but they do not necessarily deliver good performance against noisy texts. Challenges with paraphrase detection on user generated short texts, such as Twitter, include language irregularity and noise. To cope with these challenges, we propose a novel deep neural network-based approach that relies on coarse-grained sentence modeling using a convolutional neural network and a long short-term memory model, combined with a specific fine-grained word-level similarity matching model. Our experimental results show that the proposed approach outperforms existing state-of-the-art approaches on user-generated noisy social media data, such as Twitter texts, and achieves highly competitive performance on a cleaner corpus

    IndicXTREME: A Multi-Task Benchmark For Evaluating Indic Languages

    Full text link
    In this work, we introduce IndicXTREME, a benchmark consisting of nine diverse tasks covering 18 languages from the Indic sub-continent belonging to four different families. Across languages and tasks, IndicXTREME contains a total of 103 evaluation sets, of which 51 are new contributions to the literature. To maintain high quality, we only use human annotators to curate or translate our datasets. To the best of our knowledge, this is the first effort toward creating a standard benchmark for Indic languages that aims to test the zero-shot capabilities of pretrained language models. We also release IndicCorp v2, an updated and much larger version of IndicCorp that contains 20.9 billion tokens in 24 languages. We pretrain IndicBERT v2 on IndicCorp v2 and evaluate it on IndicXTREME to show that it outperforms existing multilingual language models such as XLM-R and MuRIL

    Reformulation and Decomposition: Multitask learning approaches to Long Document Problems

    Get PDF
    Recent advances in Natural Language Processing (NLP) have led to success across a wide range of tasks including machine translation, summarization, and classification. Yet, the field still faces major challenges. This thesis addresses two key under-researched areas: the absence of general multitask learning capabilities, and the inability to scale to long, complex documents. Firstly, this thesis explores a form of multitasking where NLP tasks are reformulated as question answering problems. I examine existing models and measure their robustness to paraphrasing of their input. I contribute an annotated dataset which enables detailed analysis of model failures as well as evaluating methods for improving model robustness. Secondly, a set of long document tasks; MuLD, is introduced which forms a benchmark for evaluating the performance of models on large inputs with long-range dependencies. I show that this is a challenging task for baseline models. I then design an approach using task-decomposition to provide an interpretable solution which easily allows for multitask learning. I then explore how these themes of task reformulation for multitask learning, and task-decomposition for long inputs can be applied to other modalities. I show how visual modelling: a visual analogue of language modelling, can be used to predict missing frames from videos of simple physics simulations, and probe what knowledge about the physical world this induces in such models. Finally, I demonstrate how this task can be used to unite vision and NLP using the same framework, describing how task-reformulation and task-decomposition can be used for this purpose
    • …
    corecore