860 research outputs found

    Using Embeddings to Improve Text Segmentation

    Get PDF
    Tekstipõhised andmestikud on tihti struktuuritud lausete kogumid ning seega raskesti kasutatavad paljude eesmärkide täitmiseks. Tekstis struktuuri loomine teemade või mõtete kaupa aitab teksti kokkuvõtmisel, tehisnärvivõrkudega masintõlkel ning teistel rakendustel, kus üksik lause võib pakkuda liiga vähe konteksti. Teksti segmenteerimiseks loodud meetodid on olnud kas juhendamata ning põhinevad sõnade koosesinemise vaatlemisel või juhendatud ning põhinevad sõnade ja lausete vektoresitustel. Selle magistritöö eesmärgiks on üldise teksti segmenteerimise meetodi arendamine, mis kasutab sõna-vektoreid ning koosinuskaugust. Loodud meetodi implementatsioone võrreldakse naiivse tõenäosusliku lahendusega, et hinnata loodud lahenduse efektiivsust. Ühte mudelit kasutati ka osana teksti kokkuvõtmise algoritmi osana, et hinnata lähenemise praktilist kasu. Tulemuste põhjal võib öelda, et kuigi loodud lahendus töötab paremini kui võrdlusalus, edasise uurimistööga on võimalik lähenemise võimekust märkimisväärselt tõsta.Textual data is often an unstructured collection of sentences and thus difficult to use for many purposes. Creating structure in the text according to topics or concepts can aid in text summarization, neural machine translation and other fields where a single sentence can provide too little context. There have been methods of text segmentation that are either unsupervised and based on word occurrences or supervised and based on vector representations of words and sentences. The purpose of this Master’s Thesis is to develop a general unsupervised method of text segmentation using word vector. The created ap-proach is implemented and compared to a naïve baseline to assess the viability of this method. An implemented model is used as part of extractive text summarization to as-sess the benefit of the proposed approach. The results show that while the approach out-performs the baseline, further research can greatly improve its efficacy

    Con-S2V: A Generic Framework for Incorporating Extra-Sentential Context into Sen2Vec

    Get PDF
    We present a novel approach to learn distributed representation of sentences from unlabeled data by modeling both content and context of a sentence. The content model learns sentence representation by predicting its words. On the other hand, the context model comprises a neighbor prediction component and a regularizer to model distributional and proximity hypotheses, respectively. We propose an online algorithm to train the model components jointly. We evaluate the models in a setup, where contextual information is available. The experimental results on tasks involving classification, clustering, and ranking of sentences show that our model outperforms the best existing models by a wide margin across multiple datasets

    Sentence Embedding Approach using LSTM Auto-encoder for Discussion Threads Summarization

    Get PDF
    Online discussion forums are repositories of valuable information where users interact and articulate their ideas and opinions, and share experiences about numerous topics. These online discussion forums are internet-based online communities where users can ask for help and find the solution to a problem. A new user of online discussion forums becomes exhausted from reading the significant number of irrelevant replies in a discussion. An automated discussion thread summarizing system (DTS) is necessary to create a candid view of the entire discussion of a query. Most of the previous approaches for automated DTS use the continuous bag of words (CBOW) model as a sentence embedding tool, which is poor at capturing the overall meaning of the sentence and is unable to grasp word dependency. To overcome these limitations, we introduce the LSTM Auto-encoder as a sentence embedding technique to improve the performance of DTS. The empirical result in the context of the proposed approach’s average precision, recall, and F-measure with respect to ROGUE-1 and ROUGE-2 of two standard experimental datasets demonstrates the effectiveness and efficiency of the proposed approach and outperforms the state-of-the-art CBOW model in sentence embedding tasks and boost the performance of the automated DTS model

    Scientific Information Extraction with Semi-supervised Neural Tagging

    Full text link
    This paper addresses the problem of extracting keyphrases from scientific articles and categorizing them as corresponding to a task, process, or material. We cast the problem as sequence tagging and introduce semi-supervised methods to a neural tagging model, which builds on recent advances in named entity recognition. Since annotated training data is scarce in this domain, we introduce a graph-based semi-supervised algorithm together with a data selection scheme to leverage unannotated articles. Both inductive and transductive semi-supervised learning strategies outperform state-of-the-art information extraction performance on the 2017 SemEval Task 10 ScienceIE task.Comment: accepted by EMNLP 201

    Vertical intent prediction approach based on Doc2vec and convolutional neural networks for improving vertical selection in aggregated search

    Get PDF
    Vertical selection is the task of selecting the most relevant verticals to a given query in order to improve the diversity and quality of web search results. This task requires not only predicting relevant verticals but also these verticals must be those the user expects to be relevant for his particular information need. Most existing works focused on using traditional machine learning techniques to combine multiple types of features for selecting several relevant verticals. Although these techniques are very efficient, handling vertical selection with high accuracy is still a challenging research task. In this paper, we propose an approach for improving vertical selection in order to satisfy the user vertical intent and reduce user’s browsing time and efforts. First, it generates query embeddings vectors using the doc2vec algorithm that preserves syntactic and semantic information within each query. Secondly, this vector will be used as input to a convolutional neural network model for increasing the representation of the query with multiple levels of abstraction including rich semantic information and then creating a global summarization of the query features. We demonstrate the effectiveness of our approach through comprehensive experimentation using various datasets. Our experimental findings show that our system achieves significant accuracy. Further, it realizes accurate predictions on new unseen data
    corecore