413 research outputs found

    Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks

    Full text link
    Because of their superior ability to preserve sequence information over time, Long Short-Term Memory (LSTM) networks, a type of recurrent neural network with a more complex computational unit, have obtained strong results on a variety of sequence modeling tasks. The only underlying LSTM structure that has been explored so far is a linear chain. However, natural language exhibits syntactic properties that would naturally combine words to phrases. We introduce the Tree-LSTM, a generalization of LSTMs to tree-structured network topologies. Tree-LSTMs outperform all existing systems and strong LSTM baselines on two tasks: predicting the semantic relatedness of two sentences (SemEval 2014, Task 1) and sentiment classification (Stanford Sentiment Treebank).Comment: Accepted for publication at ACL 201

    Semantic relatedness based re-ranker for text spotting

    Get PDF
    Applications such as textual entailment, plagiarism detection or document clustering rely on the notion of semantic similarity, and are usually approached with dimension reduction techniques like LDA or with embedding-based neural approaches. We present a scenario where semantic similarity is not enough, and we devise a neural approach to learn semantic relatedness. The scenario is text spotting in the wild, where a text in an image (e.g. street sign, advertisement or bus destination) must be identified and recognized. Our goal is to improve the performance of vision systems by leveraging semantic information. Our rationale is that the text to be spotted is often related to the image context in which it appears (word pairs such as Delta–airplane, or quarters–parking are not similar, but are clearly related). We show how learning a word-to-word or word-to-sentence relatedness score can improve the performance of text spotting systems up to 2.9 points, outperforming other measures in a benchmark dataset.Peer ReviewedPostprint (author's final draft

    Semantic relations between sentences: from lexical to linguistically inspired semantic features and beyond

    Get PDF
    This thesis is concerned with the identification of semantic equivalence between pairs of natural language sentences, by studying and computing models to address Natural Language Processing tasks where some form of semantic equivalence is assessed. In such tasks, given two sentences, our models output either a class label, corresponding to the semantic relation between the sentences, based on a predefined set of semantic relations, or a continuous score, corresponding to their similarity on a predefined scale. The former setup corresponds to the tasks of Paraphrase Identification and Natural Language Inference, while the latter corresponds to the task of Semantic Textual Similarity. We present several models for English and Portuguese, where various types of features are considered, for instance based on distances between alternative representations of each sentence, following lexical and semantic frameworks, or embeddings from pre-trained Bidirectional Encoder Representations from Transformers models. For English, a new set of semantic features is proposed, from the formal semantic representation of Discourse Representation Structure. In Portuguese, suitable corpora are scarce and formal semantic representations are unavailable, hence an evaluation of currently available features and corpora is conducted, following the modelling setup employed for English. Competitive results are achieved on all tasks, for both English and Portuguese, particularly when considering that our models are based on generally available tools and technologies, and that all features and models are suitable for computation in most modern computers, except for those based on embeddings. In particular, for English, our semantic features from DRS are able to improve the performance of other models, when integrated in the feature set of such models, and state of the art results are achieved for Portuguese, with models based on fine tuning embeddings to a specific task; Sumário: Relações semânticas entre frases: de aspectos lexicais a aspectos semânticos inspirados em linguística e além destes Esta tese é dedicada à identificação de equivalência semântica entre frases em língua natural, através do estudo e computação de modelos destinados a tarefas de Processamento de Linguagem Natural relacionadas com alguma forma de equivalência semântica. Em tais tarefas, a partir de duas frases, os nossos modelos produzem uma etiqueta de classificação, que corresponde à relação semântica entre as frases, baseada num conjunto predefinido de possíveis relações semânticas, ou um valor contínuo, que corresponde à similaridade das frases numa escala predefinida. A primeira configuração mencionada corresponde às tarefas de Identificação de Paráfrases e de Inferência em Língua Natural, enquanto que a última configuração mencionada corresponde à tarefa de Similaridade Semântica em Texto. Apresentamos diversos modelos para Inglês e Português, onde vários tipos de aspectos são considerados, por exemplo baseados em distâncias entre representações alternativas para cada frase, seguindo formalismos semânticos e lexicais, ou vectores contextuais de modelos previamente treinados com Representações Codificadas Bidirecionalmente a partir de Transformadores. Para Inglês, propomos um novo conjunto de aspectos semânticos, a partir da representação formal de semântica em Estruturas de Representação de Discurso. Para Português, os conjuntos de dados apropriados são escassos e não estão disponíveis representações formais de semântica, então implementámos uma avaliação de aspectos actualmente disponíveis, seguindo a configuração de modelos aplicada para Inglês. Obtivemos resultados competitivos em todas as tarefas, em Inglês e Português, particularmente considerando que os nossos modelos são baseados em ferramentas e tecnologias disponíveis, e que todos os nossos aspectos e modelos são apropriados para computação na maioria dos computadores modernos, excepto os modelos baseados em vectores contextuais. Em particular, para Inglês, os nossos aspectos semânticos a partir de Estruturas de Representação de Discurso melhoram o desempenho de outros modelos, quando integrados no conjunto de aspectos de tais modelos, e obtivemos resultados estado da arte para Português, com modelos baseados em afinação de vectores contextuais para certa tarefa

    A Continuously Growing Dataset of Sentential Paraphrases

    Full text link
    A major challenge in paraphrase research is the lack of parallel corpora. In this paper, we present a new method to collect large-scale sentential paraphrases from Twitter by linking tweets through shared URLs. The main advantage of our method is its simplicity, as it gets rid of the classifier or human in the loop needed to select data before annotation and subsequent application of paraphrase identification algorithms in the previous work. We present the largest human-labeled paraphrase corpus to date of 51,524 sentence pairs and the first cross-domain benchmarking for automatic paraphrase identification. In addition, we show that more than 30,000 new sentential paraphrases can be easily and continuously captured every month at ~70% precision, and demonstrate their utility for downstream NLP tasks through phrasal paraphrase extraction. We make our code and data freely available.Comment: 11 pages, accepted to EMNLP 201
    corecore