405 research outputs found

    Exploiting extensible background knowledge for clustering-based automatic keyphrase extraction

    Get PDF
    Keyphrases are single- or multi-word phrases that are used to describe the essential content of a document. Utilizing an external knowledge source such as WordNet is often used in keyphrase extraction methods to obtain relation information about terms and thus improves the result, but the drawback is that a sole knowledge source is often limited. This problem is identified as the coverage limitation problem. In this paper, we introduce SemCluster, a clustering-based unsupervised keyphrase extraction method that addresses the coverage limitation problem by using an extensible approach that integrates an internal ontology (i.e., WordNet) with other knowledge sources to gain a wider background knowledge. SemCluster is evaluated against three unsupervised methods, TextRank, ExpandRank, and KeyCluster, and under the F1-measure metric. The evaluation results demonstrate that SemCluster has better accuracy and computational efficiency and is more robust when dealing with documents from different domains

    Theme-driven Keyphrase Extraction to Analyze Social Media Discourse

    Full text link
    Social media platforms are vital resources for sharing self-reported health experiences, offering rich data on various health topics. Despite advancements in Natural Language Processing (NLP) enabling large-scale social media data analysis, a gap remains in applying keyphrase extraction to health-related content. Keyphrase extraction is used to identify salient concepts in social media discourse without being constrained by predefined entity classes. This paper introduces a theme-driven keyphrase extraction framework tailored for social media, a pioneering approach designed to capture clinically relevant keyphrases from user-generated health texts. Themes are defined as broad categories determined by the objectives of the extraction task. We formulate this novel task of theme-driven keyphrase extraction and demonstrate its potential for efficiently mining social media text for the use case of treatment for opioid use disorder. This paper leverages qualitative and quantitative analysis to demonstrate the feasibility of extracting actionable insights from social media data and efficiently extracting keyphrases using minimally supervised NLP models. Our contributions include the development of a novel data collection and curation framework for theme-driven keyphrase extraction and the creation of MOUD-Keyphrase, the first dataset of its kind comprising human-annotated keyphrases from a Reddit community. We also identify the scope of minimally supervised NLP models to extract keyphrases from social media data efficiently. Lastly, we found that a large language model (ChatGPT) outperforms unsupervised keyphrase extraction models, and we evaluate its efficacy in this task.Comment: 11 pages, 2 figures, submitted to ICWSM. This version represents a substantial expansion and refocus of the previous manuscript, including new experiments, expanded data analysis, and comprehensive discussion

    Construindo grafos de conhecimento utilizando documentos textuais para análise de literatura científica

    Get PDF
    Orientador: Julio Cesar dos ReisDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: O número de publicações científicas que pesquisadores tem que ler vem aumento nos últimos anos. Consequentemente, dentre várias opções, é difícil para eles identificarem documentos relevantes relacionados aos seus estudos. Ademais, para entender como um campo científico é organizado, e para estudar o seu estado da arte, pesquisadores geralmente se baseiam em artigos de revisão de uma área. Estes artigos podem estar indisponíveis ou desatualizados dependendo do tema estudado. Usualmente, pesquisadores têm que realizar esta árdua tarefa de pesquisa fundamental manualmente. Pesquisas recentes vêm desenvolvendo mecanismos para auxiliar outros pesquisadores a entender como campos científicos são estruturados. Entretanto, estes mecanismos são focados exclusivamente em recomendar artigos relevantes para os pesquisadores ou os auxiliar em entender como um ramo da ciência é organizado ao nível de publicação. Desta forma, estes métodos limitam o entendimento sobre o ramo estudado, não permitindo que interessados estudem os conceitos e relações abstratas que compõe um ramo da ciência e as suas subáreas. Esta dissertação de mestrado propõe um framework para estruturar, analisar, e rastrear a evolução de um campo científico no nível dos seus conceitos. Ela primeiramente estrutura o campo científico como um grafo-de-conhecimento utilizando os seus conceitos como vértices. A seguir, ela automaticamente identifica as principais subáreas do campo estudado, extrai as suas frases-chave, e estuda as suas relações. Nosso framework representa o campo científico em diferentes períodos do tempo. Esta dissertação compara estas representações, e identifica como as subáreas do campo estudado evoluiram no decorrer dos anos. Avaliamos cada etapa do nosso framework representando e analisando dados científicos provenientes de diferentes áreas de conhecimento em casos de uso. Nossas descobertas indicam o sucesso em detectar resultados similares em diferentes casos de uso, indicando que nossa abordagem é aplicável à diferentes domínios da ciência. Esta pesquisa também contribui com uma aplicação com interface web para auxiliar pesquisadores a utilizarem nosso framework de forma gráfica. Ao utilizar nossa aplicação, pesquisadores podem ter uma análise geral de como um campo científico é estruturado e como ele evoluiAbstract: The amount of publications a researcher must absorb has been increasing over the last years. Consequently, among so many options, it is hard for them to identify interesting documents to read related to their studies. Researchers usually search for review articles to understand how a scientific field is organized and to study its state of the art. This option can be unavailable or outdated depending on the studied area. Usually, they have to do such laborious task of background research manually. Recent researches have developed mechanisms to assist researchers in understanding the structure of scientific fields. However, those mechanisms focus on recommending relevant articles to researchers or supporting them in understanding how a scientific field is organized considering documents that belong to it. These methods limit the field understanding, not allowing researchers to study the underlying concepts and relations that compose a scientific field and its sub-areas. This Ms.c. thesis proposes a framework to structure, analyze, and track the evolution of a scientific field at a concept level. Given a set of textual documents as research papers, it first structures a scientific field as a knowledge graph using its detected concepts as vertices. Then, it automatically identifies the field's main sub-areas, extracts their keyphrases, and studies their relations. Our framework enables to represent the scientific field in distinct time-periods. It allows to compare its representations and identify how the field's areas changed over time. We evaluate each step of our framework representing and analyzing scientific data from distinct fields of knowledge in case studies. Our findings indicate the success in detecting the sub-areas based on the generated graph from natural language documents. We observe similar outcomes in the different case studies by indicating our approach applicable to distinct domains. This research also contributes with a web-based software tool that allows researchers to use the proposed framework graphically. By using our application, researchers can have an overview analysis of how a scientific field is structured and how it evolvedMestradoCiência da ComputaçãoMestre em Ciência da Computação2013/08293-7 ; 2017/02325-5FAPESPCAPE

    Topic Distiller:distilling semantic topics from documents

    Get PDF
    Abstract. This thesis details the design and implementation of a system that can find relevant and latent semantic topics from textual documents. The design of this system, named Topic Distiller, is inspired by research conducted on automatic keyphrase extraction and automatic topic labeling, and it employs entity linking and knowledge bases to reduce text documents to their semantic topics. The Topic Distiller is evaluated using methods and datasets used in information retrieval and automatic keyphrase extraction. On top of the common datasets used in the literature three additional datasets are created to evaluate the system. The evaluation reveals that the Topic Distiller is able to find relevant and latent topics from textual documents, beating the state-of-the-art automatic keyphrase methods in performance when used on news articles and social media posts.Semanttisten aiheiden suodattaminen dokumenteista. Tiivistelmä. Tässä diplomityössä tarkastellaan järjestelmää, joka pystyy löytämään tekstistä relevantteja ja piileviä semanttisia aihealueita, sekä kyseisen järjestelmän suunnittelua ja implementaatiota. Tämän Topic Distiller -järjestelmän suunnittelu ammentaa inspiraatiota automaattisen termintunnistamisen ja automaattisen aiheiden nimeämisen tutkimuksesta sekä hyödyntää automaattista semanttista annotointia ja tietämyskantoja tekstin aihealueiden löytämisessä. Topic Distiller -järjestelmän suorituskykyä mitataan hyödyntämällä kirjallisuudessa paljon käytettyjä automaattisen termintunnistamisen evaluontimenetelmiä ja aineistoja. Näiden yleisten aineistojen lisäksi esittelemme kolme uutta aineistoa, jotka on luotu Topic Distiller -järjestelmän arviointia varten. Evaluointi tuo ilmi, että Topic Distiller kykenee löytämään relevantteja ja piileviä aiheita tekstistä. Se päihittää kirjallisuuden viimeisimmät automaattisen termintunnistamisen menetelmät suorituskyvyssä, kun sitä käytetään uutisartikkelien sekä sosiaalisen median julkaisujen analysointiin
    corecore