96 research outputs found

    Automatic Keyphrase Extraction

    Get PDF
    Increasing number of documents in the Web caused the growth of needs for tools supporting automatic search and classification of texts. Keywords are one of characteristic features of documents that may be used as criteria in automatic document management. In the paper we describe the technique for automatic keyphrase extraction based on the KEA algorithm [1]. The main modifications consist in changes in the stemming method and simplification of the discretization technique. Besides, in the presented algorithm the keyphrase list may contain proper names, and the candidate phrase list may contain number sequences. We describe experiments, that were done on the set of English language documents available in the Internet and that allow for optimization of extraction parameters. The comparison of the efficiency of the algorithm with the KEA technique is presented

    Proceedings of the 2nd Computer Science Student Workshop: Microsoft Istanbul, Turkey, April 9, 2011

    Get PDF

    Topic Distiller:distilling semantic topics from documents

    Get PDF
    Abstract. This thesis details the design and implementation of a system that can find relevant and latent semantic topics from textual documents. The design of this system, named Topic Distiller, is inspired by research conducted on automatic keyphrase extraction and automatic topic labeling, and it employs entity linking and knowledge bases to reduce text documents to their semantic topics. The Topic Distiller is evaluated using methods and datasets used in information retrieval and automatic keyphrase extraction. On top of the common datasets used in the literature three additional datasets are created to evaluate the system. The evaluation reveals that the Topic Distiller is able to find relevant and latent topics from textual documents, beating the state-of-the-art automatic keyphrase methods in performance when used on news articles and social media posts.Semanttisten aiheiden suodattaminen dokumenteista. Tiivistelmä. Tässä diplomityössä tarkastellaan järjestelmää, joka pystyy löytämään tekstistä relevantteja ja piileviä semanttisia aihealueita, sekä kyseisen järjestelmän suunnittelua ja implementaatiota. Tämän Topic Distiller -järjestelmän suunnittelu ammentaa inspiraatiota automaattisen termintunnistamisen ja automaattisen aiheiden nimeämisen tutkimuksesta sekä hyödyntää automaattista semanttista annotointia ja tietämyskantoja tekstin aihealueiden löytämisessä. Topic Distiller -järjestelmän suorituskykyä mitataan hyödyntämällä kirjallisuudessa paljon käytettyjä automaattisen termintunnistamisen evaluontimenetelmiä ja aineistoja. Näiden yleisten aineistojen lisäksi esittelemme kolme uutta aineistoa, jotka on luotu Topic Distiller -järjestelmän arviointia varten. Evaluointi tuo ilmi, että Topic Distiller kykenee löytämään relevantteja ja piileviä aiheita tekstistä. Se päihittää kirjallisuuden viimeisimmät automaattisen termintunnistamisen menetelmät suorituskyvyssä, kun sitä käytetään uutisartikkelien sekä sosiaalisen median julkaisujen analysointiin

    Visual Summarization of Scholarly Videos using Word Embeddings and Keyphrase Extraction

    Full text link
    Effective learning with audiovisual content depends on many factors. Besides the quality of the learning resource's content, it is essential to discover the most relevant and suitable video in order to support the learning process most effectively. Video summarization techniques facilitate this goal by providing a quick overview over the content. It is especially useful for longer recordings such as conference presentations or lectures. In this paper, we present an approach that generates a visual summary of video content based on semantic word embeddings and keyphrase extraction. For this purpose, we exploit video annotations that are automatically generated by speech recognition and video OCR (optical character recognition).Comment: 12 pages, 5 figure

    Advances in Automatic Keyphrase Extraction

    Get PDF
    The main purpose of this thesis is to analyze and propose new improvements in the field of Automatic Keyphrase Extraction, i.e., the field of automatically detecting the key concepts in a document. We will discuss, in particular, supervised machine learning algorithms for keyphrase extraction, by first identifying their shortcomings and then proposing new techniques which exploit contextual information to overcome them. Keyphrase extraction requires that the key concepts, or \emph{keyphrases}, appear verbatim in the body of the document. We will identify the fact that current algorithms do not use contextual information when detecting keyphrases as one of the main shortcomings of supervised keyphrase extraction. Instead, statistical and positional cues, like the frequency of the candidate keyphrase or its first appearance in the document, are mainly used to determine if a phrase appearing in a document is a keyphrase or not. For this reason, we will prove that a supervised keyphrase extraction algorithm, by using only statistical and positional features, is actually able to extract good keyphrases from documents written in languages that it has never seen. The algorithm will be trained over a common dataset for the English language, a purpose-collected dataset for the Arabic language, and evaluated on the Italian, Romanian and Portuguese languages as well. This result is then used as a starting point to develop new algorithms that use contextual information to increase the performance in automatic keyphrase extraction. The first algorithm that we present uses new linguistics features based on anaphora resolution, which is a field of natural language processing that exploits the relations between elements of the discourse as, e.g., pronouns. We evaluate several supervised AKE pipelines based on these features on the well-known SEMEVAL 2010 dataset, and we show that the performance increases when we add such features to a model that employs statistical and positional knowledge only. Finally, we investigate the possibilities offered by the field of Deep Learning, by proposing six different deep neural networks that perform automatic keyphrase extraction. Such networks are based on bidirectional long-short term memory networks, or on convolutional neural networks, or on a combination of both of them, and on a neural language model which creates a vector representation of each word of the document. These networks are able to learn new features using the the whole document when extracting keyphrases, and they have the advantage of not needing a corpus after being trained to extract keyphrases from new documents. We show that with deep learning based architectures we are able to outperform several other keyphrase extraction algorithms, both supervised and not supervised, used in literature and that the best performances are obtained when we build an additional neural representation of the input document and we append it to the neural language model. Both the anaphora-based and the deep-learning based approaches show that using contextual information, the performance in supervised algorithms for automatic keyphrase extraction improves. In fact, in the methods presented in this thesis, the algorithms which obtained the best performance are the ones receiving more contextual information, both about the relations of the potential keyphrase with other parts of the document, as in the anaphora based approach, and in the shape of a neural representation of the input document, as in the deep learning approach. In contrast, the approach of using statistical and positional knowledge only allows the building of language agnostic keyphrase extraction algorithms, at the cost of decreased precision and recall

    Learning Document-Level Semantic Properties from Free-Text Annotations

    Get PDF
    This paper presents a new method for inferring the semantic properties of documents by leveraging free-text keyphrase annotations. Such annotations are becoming increasingly abundant due to the recent dramatic growth in semi-structured, user-generated online content. One especially relevant domain is product reviews, which are often annotated by their authors with pros/cons keyphrases such as ``a real bargain'' or ``good value.'' These annotations are representative of the underlying semantic properties; however, unlike expert annotations, they are noisy: lay authors may use different labels to denote the same property, and some labels may be missing. To learn using such noisy annotations, we find a hidden paraphrase structure which clusters the keyphrases. The paraphrase structure is linked with a latent topic model of the review texts, enabling the system to predict the properties of unannotated documents and to effectively aggregate the semantic properties of multiple reviews. Our approach is implemented as a hierarchical Bayesian model with joint inference. We find that joint inference increases the robustness of the keyphrase clustering and encourages the latent topics to correlate with semantically meaningful properties. Multiple evaluations demonstrate that our model substantially outperforms alternative approaches for summarizing single and multiple documents into a set of semantically salient keyphrases

    High compression rate text summarization

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.Includes bibliographical references (p. 95-97).This thesis focuses on methods for condensing large documents into highly concise summaries, achieving compression rates on par with human writers. While the need for such summaries in the current age of information overload is increasing, the desired compression rate has thus far been beyond the reach of automatic summarization systems. The potency of our summarization methods is due to their in-depth modelling of document content in a probabilistic framework. We explore two types of document representation that capture orthogonal aspects of text content. The first represents the semantic properties mentioned in a document in a hierarchical Bayesian model. This method is used to summarize thousands of consumer reviews by identifying the product properties mentioned by multiple reviewers. The second representation captures discourse properties, modelling the connections between different segments of a document. This discriminatively trained model is employed to generate tables of contents for books and lecture transcripts. The summarization methods presented here have been incorporated into large-scale practical systems that help users effectively access information online.by Satchuthananthavale Rasiah Kuhan Branavan.S.M

    Learning semantic structures from in-domain documents

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2011.Cataloged from PDF version of thesis.Includes bibliographical references (p. 175-184).Semantic analysis is a core area of natural language understanding that has typically focused on predicting domain-independent representations. However, such representations are unable to fully realize the rich diversity of technical content prevalent in a variety of specialized domains. Taking the standard supervised approach to domainspecific semantic analysis requires expensive annotation effort for each new domain of interest. In this thesis, we study how multiple granularities of semantic analysis can be learned from unlabeled documents within the same domain. By exploiting in-domain regularities in the expression of text at various layers of linguistic phenomena, including lexicography, syntax, and discourse, the statistical approaches we propose induce multiple kinds of structure: relations at the phrase and sentence level, content models at the paragraph and section level, and semantic properties at the document level. Each of our models is formulated in a hierarchical Bayesian framework with the target structure captured as latent variables, allowing them to seamlessly incorporate linguistically-motivated prior and posterior constraints, as well as multiple kinds of observations. Our empirical results demonstrate that the proposed approaches can successfully extract hidden semantic structure over a variety of domains, outperforming multiple competitive baselines.by Harr Chen.Ph.D
    corecore