14 research outputs found

    Evaluating Similarity Metrics for Latent Twitter Topics

    Get PDF
    Topic modelling approaches such as LDA, when applied on a tweet corpus, can often generate a topic model containing redundant topics. To evaluate the quality of a topic model in terms of redundancy, topic similarity metrics can be applied to estimate the similarity among topics in a topic model. There are various topic similarity metrics in the literature, e.g. the Jensen Shannon (JS) divergence-based metric. In this paper, we evaluate the performances of four distance/divergence-based topic similarity metrics and examine how they align with human judgements, including a newly proposed similarity metric that is based on computing word semantic similarity using word embeddings (WE). To obtain human judgements, we conduct a user study through crowdsourcing. Among various insights, our study shows that in general the cosine similarity (CS) and WE-based metrics perform better and appear to be complementary. However, we also find that the human assessors cannot easily distinguish between the distance/divergence-based and the semantic similarity-based metrics when identifying similar latent Twitter topics

    Modeling technological topic changes in patent claims

    Full text link
    © 2014 Portland International Conference on Management of Engineering and Technology. Patent claims usually embody the most essential terms and the core technological scope to define the protection of an invention, which makes them the ideal resource for patent content and topic change analysis. However, manually conducting content analysis on massive technical terms is very time consuming and laborious. Even with the help of traditional text mining techniques, it is still difficult to model topic changes over time, because single keywords alone are usually too general or ambiguous to represent a concept. Moreover, term frequency which used to define a topic cannot separate polysemous words that are actually describing a different theme. To address this issue, this research proposes a topic change identification approach based on Latent Dirichlet Allocation to model and analyze topic changes with minimal human intervention. After textual data cleaning, underlying semantic topics hidden in large archives of patent claims are revealed automatically. Concepts are defined by probability distributions over words instead of term frequency, so that polysemy is allowed. A case study using patents published in the United States Patent and Trademark Office (USPTO) from 2009 to 2013 with Australia as their assignee country is presented to demonstrate the validity of the proposed topic change identification approach. The experimental result shows that the proposed approach can be used as an automatic tool to provide machine-identified topic changes for more efficient and effective R&D management assistance

    The Shifting Attention of Political Leaders: Evidence from Two Centuries of Presidential Speeches

    Full text link
    We use natural-language-processing algorithms on a novel dataset of over 900 presidential speeches from ten Latin American countries spanning two centuries to study the dynamics and determinants of presidential policy priorities. We show that most speech content can be characterized by a compact set of policy issues whose relative composition exhibited slow yet substantial shifts over 1819-2022. Presidential attention initially centered on military interventions and the development of state capacity. Attention gradually evolved towards building physical capital through investments in infrastructure and public services and finally turned towards building human capital through investments in education, health, and social safety nets. We characterize the way in which president-level characteristics, like age and gender, predict the main policy issues. Our findings offer novel insights into the dynamics of presidential attention and the factors that shape it, expanding our understanding of political agenda-setting.Comment: JEL codes: D78, I32, D72, N1

    Cannabis in Danish newspapers

    Get PDF
    Using quantitative methods Danish cannabis debate in national newspapers is investigated. The investigation shows that the most prevalent topics relate to law enforcement. Legalization has become an increasingly important topic in the Danish cannabis debate and the investigation shows a reframing of this debate to become increasingly related to concerns about organized crime. In this way the Danish cannabis legalization debate show the same development as the debates that have led to legalization certain states in the United States of America

    Development of research trends evolution model for computer science for Malaysian publication

    Get PDF
    Nowadays, there seem to be research trends done on studies that manipulate publications that utilise the text mining approach. However, most of these studies only investigated the gaps faced by existing research trends models, and the execution of text mining of bibliometric elements and the timeline windows representing the "trends" was not clarified. Thus, this study aimed to develop the conceptual model for research trends in Malaysian publications, specifically, to incorporate the text element of bibliometrics and the execution of timeline windows to identify research trends. In the context of research trends, the evolution or growth of some research area from one period to another is important. This included what has happened, what is currently happening, and predicting potential research trends that will happen in the near future for others to continue the research development. The element in the newly developed model was extracted from the literature review and adapted from one of the selected models. The new model consisted of three stages which is the first stage consisted of three elements - selecting document collection; the second stage was the selection of the bibliometric element; and the third stage was the execution of text mining, co-word analysis from the selected textual bibliometric element, the implementation of two timeline windows (fixed time and sliding time windows-timeline). Also, the execution of the third stage required aid from tools - CiteSpace. The newly developed model was tested, and data were downloaded from two databases, Scopus (10,052 publications) and Web of Science (WoS) (22,088 publications), for a duration between 1995 and 2019. This study identified that the research trend pattern became more active from 2002 onwards. Besides that, the research topic became fresher and more unconventional throughout the timelines. Research topics on artificial intelligence, network communication, and wireless sensor networks are the hottest topics and timeless. Besides that, knowledge management, internet banking, online shopping, and eCommerce were the alternative options for computer science researchers. Each timeline's evolution and blooming shows that researchers are investigating each topic thoroughly. In addition, some small topics do not appear in fixed timeline windows but instead emerge from sliding timeline windows, such as system development, shared banking service, virtual team collaboration, and internet policy. This study also captured the highlighted keywords that could give hints or appear as an initial idea for the next research journey. Experts' evaluation and validation were executed as the interpretation of experimental results require experts' expertise, experience, and views. A semi-structured interview was done with thirteen experts who have remarkable expertise in research and development. From the discussion, most experts agreed that the model could help others identify the research trends and potential new research topics emerging for future research journeys. The newly developed model could be beneficial to those who need hints for their next exploration and help those keen to understand how to execute the text mining within the bibliometric elements

    The statistics of topic modelling.

    Get PDF
    This research project aims to provide a clear and concise guide to latent dirichlet allocation which is a form of topic modelling. The aim is to help researchers who do not have a strong background in mathematics or statistics to feel comfortable with using topic modelling in their work. In order to achieve this, the thesis provides a step-by-step explanation of how topic modelling works. A range of tools that can be used to perform a topic model analysis are also described. The first chapter gives an explanation of how topic modelling, and (more specifically), latent dirichlet allocation works; it offers a very basic explanation and then provides an easy to follow mathematical explanation. The second chapter explains how to perform a topic model analysis; this is done through an explanation of each step used to run a topic model analysis, starting from the type of dataset through to the software packages available to use. The third section provides an example topic model analysis, based on the Philpapers dataset. The final section provides a discussion on the highlights of each chapter and areas for further research
    corecore