225 research outputs found

    DivGraphPointer: A Graph Pointer Network for Extracting Diverse Keyphrases

    Full text link
    Keyphrase extraction from documents is useful to a variety of applications such as information retrieval and document summarization. This paper presents an end-to-end method called DivGraphPointer for extracting a set of diversified keyphrases from a document. DivGraphPointer combines the advantages of traditional graph-based ranking methods and recent neural network-based approaches. Specifically, given a document, a word graph is constructed from the document based on word proximity and is encoded with graph convolutional networks, which effectively capture document-level word salience by modeling long-range dependency between words in the document and aggregating multiple appearances of identical words into one node. Furthermore, we propose a diversified point network to generate a set of diverse keyphrases out of the word graph in the decoding process. Experimental results on five benchmark data sets show that our proposed method significantly outperforms the existing state-of-the-art approaches.Comment: Accepted to SIGIR 201

    Consensus-based Approach for Keyword Extraction from Urban Events Collections

    Get PDF
    Automatic keyword extraction (AKE) from textual sources took a valuable step towards harnessing the problem of efficient scanning of large document collections. Particularly in the context of urban mobility, where the most relevant events in the city are advertised on-line, it becomes difficult to know exactly what is happening in a place./nIn this paper we tackle this problem by extracting a set of keywords from different kinds of textual sources, focusing on the urban events context. We propose an ensemble of automatic keyword extraction systems KEA (Key-phrase Extraction Algorithm) and KUSCO (Knowledge Unsupervised Search for instantiating Concepts on lightweight Ontologies) and Conditional Random Fields (CRF)./nUnlike KEA and KUSCO which are well-known tools for automatic keyword extraction, CRF needs further pre-processing. Therefore, a tool for handling AKE from the documents using CRF is developed. The architecture for the AKE ensemble system is designed and efficient integration of component applications is presented in which a consensus between such classifiers is achieved. Finally, we empirically show that our AKE ensemble system significantly succeeds on baseline sources and urban events collections

    A Survey on Semantic Processing Techniques

    Full text link
    Semantic processing is a fundamental research domain in computational linguistics. In the era of powerful pre-trained language models and large language models, the advancement of research in this domain appears to be decelerating. However, the study of semantics is multi-dimensional in linguistics. The research depth and breadth of computational semantic processing can be largely improved with new technologies. In this survey, we analyzed five semantic processing tasks, e.g., word sense disambiguation, anaphora resolution, named entity recognition, concept extraction, and subjectivity detection. We study relevant theoretical research in these fields, advanced methods, and downstream applications. We connect the surveyed tasks with downstream applications because this may inspire future scholars to fuse these low-level semantic processing tasks with high-level natural language processing tasks. The review of theoretical research may also inspire new tasks and technologies in the semantic processing domain. Finally, we compare the different semantic processing techniques and summarize their technical trends, application trends, and future directions.Comment: Published at Information Fusion, Volume 101, 2024, 101988, ISSN 1566-2535. The equal contribution mark is missed in the published version due to the publication policies. Please contact Prof. Erik Cambria for detail

    Topic Distiller:distilling semantic topics from documents

    Get PDF
    Abstract. This thesis details the design and implementation of a system that can find relevant and latent semantic topics from textual documents. The design of this system, named Topic Distiller, is inspired by research conducted on automatic keyphrase extraction and automatic topic labeling, and it employs entity linking and knowledge bases to reduce text documents to their semantic topics. The Topic Distiller is evaluated using methods and datasets used in information retrieval and automatic keyphrase extraction. On top of the common datasets used in the literature three additional datasets are created to evaluate the system. The evaluation reveals that the Topic Distiller is able to find relevant and latent topics from textual documents, beating the state-of-the-art automatic keyphrase methods in performance when used on news articles and social media posts.Semanttisten aiheiden suodattaminen dokumenteista. Tiivistelmä. Tässä diplomityössä tarkastellaan järjestelmää, joka pystyy löytämään tekstistä relevantteja ja piileviä semanttisia aihealueita, sekä kyseisen järjestelmän suunnittelua ja implementaatiota. Tämän Topic Distiller -järjestelmän suunnittelu ammentaa inspiraatiota automaattisen termintunnistamisen ja automaattisen aiheiden nimeämisen tutkimuksesta sekä hyödyntää automaattista semanttista annotointia ja tietämyskantoja tekstin aihealueiden löytämisessä. Topic Distiller -järjestelmän suorituskykyä mitataan hyödyntämällä kirjallisuudessa paljon käytettyjä automaattisen termintunnistamisen evaluontimenetelmiä ja aineistoja. Näiden yleisten aineistojen lisäksi esittelemme kolme uutta aineistoa, jotka on luotu Topic Distiller -järjestelmän arviointia varten. Evaluointi tuo ilmi, että Topic Distiller kykenee löytämään relevantteja ja piileviä aiheita tekstistä. Se päihittää kirjallisuuden viimeisimmät automaattisen termintunnistamisen menetelmät suorituskyvyssä, kun sitä käytetään uutisartikkelien sekä sosiaalisen median julkaisujen analysointiin

    Persian Keyphrase Generation Using Sequence-to-Sequence Models

    Full text link
    Keyphrases are a very short summary of an input text and provide the main subjects discussed in the text. Keyphrase extraction is a useful upstream task and can be used in various natural language processing problems, for example, text summarization and information retrieval, to name a few. However, not all the keyphrases are explicitly mentioned in the body of the text. In real-world examples there are always some topics that are discussed implicitly. Extracting such keyphrases requires a generative approach, which is adopted here. In this paper, we try to tackle the problem of keyphrase generation and extraction from news articles using deep sequence-to-sequence models. These models significantly outperform the conventional methods such as Topic Rank, KPMiner, and KEA in the task of keyphrase extraction

    Automatic Keyphrase Extraction: A Survey of the State of the Art

    Full text link

    Concept graphs: Applications to biomedical text categorization and concept extraction

    Get PDF
    As science advances, the underlying literature grows rapidly providing valuable knowledge mines for researchers and practitioners. The text content that makes up these knowledge collections is often unstructured and, thus, extracting relevant or novel information could be nontrivial and costly. In addition, human knowledge and expertise are being transformed into structured digital information in the form of vocabulary databases and ontologies. These knowledge bases hold substantial hierarchical and semantic relationships of common domain concepts. Consequently, automating learning tasks could be reinforced with those knowledge bases through constructing human-like representations of knowledge. This allows developing algorithms that simulate the human reasoning tasks of content perception, concept identification, and classification. This study explores the representation of text documents using concept graphs that are constructed with the help of a domain ontology. In particular, the target data sets are collections of biomedical text documents, and the domain ontology is a collection of predefined biomedical concepts and relationships among them. The proposed representation preserves those relationships and allows using the structural features of graphs in text mining and learning algorithms. Those features emphasize the significance of the underlying relationship information that exists in the text content behind the interrelated topics and concepts of a text document. The experiments presented in this study include text categorization and concept extraction applied on biomedical data sets. The experimental results demonstrate how the relationships extracted from text and captured in graph structures can be used to improve the performance of the aforementioned applications. The discussed techniques can be used in creating and maintaining digital libraries through enhancing indexing, retrieval, and management of documents as well as in a broad range of domain-specific applications such as drug discovery, hypothesis generation, and the analysis of molecular structures in chemoinformatics
    corecore