486 research outputs found

    Dirichlet belief networks for topic structure learning

    Full text link
    Recently, considerable research effort has been devoted to developing deep architectures for topic models to learn topic structures. Although several deep models have been proposed to learn better topic proportions of documents, how to leverage the benefits of deep structures for learning word distributions of topics has not yet been rigorously studied. Here we propose a new multi-layer generative process on word distributions of topics, where each layer consists of a set of topics and each topic is drawn from a mixture of the topics of the layer above. As the topics in all layers can be directly interpreted by words, the proposed model is able to discover interpretable topic hierarchies. As a self-contained module, our model can be flexibly adapted to different kinds of topic models to improve their modelling accuracy and interpretability. Extensive experiments on text corpora demonstrate the advantages of the proposed model.Comment: accepted in NIPS 201

    Innovative Heuristics to Improve the Latent Dirichlet Allocation Methodology for Textual Analysis and a New Modernized Topic Modeling Approach

    Get PDF
    Natural Language Processing is a complex method of data mining the vast trove of documents created and made available every day. Topic modeling seeks to identify the topics within textual corpora with limited human input into the process to speed analysis. Current topic modeling techniques used in Natural Language Processing have limitations in the pre-processing steps. This dissertation studies topic modeling techniques, those limitations in the pre-processing, and introduces new algorithms to gain improvements from existing topic modeling techniques while being competitive with computational complexity. This research introduces four contributions to the field of Natural Language Processing and topic modeling. First, this research identifies a requirement for a more robust “stopwords” list and proposes a heuristic for creating a more robust list. Second, a new dimensionality-reduction technique is introduced that exploits the number of words within a document to infer importance to word choice. Third, an algorithm is developed to determine the number of topics within a corpus and demonstrated using a standard topic modeling data set. These techniques produce a higher quality result from the Latent Dirichlet Allocation topic modeling technique. Fourth, a novel heuristic utilizing Principal Component Analysis is introduced that is capable of determining the number of topics within a corpus that produces stable sets of topic words

    Linking Information Technology and Entrepreneurship: A Literature Review

    Get PDF
    This study reviews research carried out in the domain of IT and entrepreneurship. A total of 907 papers, published between 1978 and 2020 were used to uncover the latent topics addressed in the domain. A topic modeling (LDA) algorithm was used to automate the process of extracting the initial research topics from the data. The literature review further enhances the understating of IT-associated entrepreneurship research, providing useful insights for future research and informs practice in this domain of study

    Quantifying shifts in topic popularity over 44 years of austral ecology

    Get PDF
    The Ecological Society of Australia was founded in 1959, and the society’s journal was first published in 1976. To examine how research published in the society’s journal has changed over this time, we used text mining to quantify themes and trends in the body of work published by the Australian Journal of Ecology and Austral Ecology from 1976 to 2019. We used topic models to identify 30 ‘topics’ within 2778 full-text articles in 246 issues of the journal, followed by mixed modelling to identify topics with above-average or below-average popularity in terms of the number of publications or citations that they contain. We found high inter-decadal turnover in research topics, with an early emphasis on highly specific ecosystems or processes giving way to a modern emphasis on community, spatial and fire ecology, invasive species and statistical modelling. Despite an early focus on Australian research, papers discussing South American ecosystems are now among the fastest-growing and most frequently cited topics in the journal. Topics that were growing fastest in publication rates were not always the same as those with high citation rates. Our results provide a systematic breakdown of the topics that Austral Ecology authors and editors have chosen to research, publish and cite through time, providing a valuable window into the historical and emerging foci of the journal. © 2020 Ecological Society of Australi

    Choosing the Number of Topics in LDA Models -- A Monte Carlo Comparison of Selection Criteria

    Full text link
    Selecting the number of topics in LDA models is considered to be a difficult task, for which alternative approaches have been proposed. The performance of the recently developed singular Bayesian information criterion (sBIC) is evaluated and compared to the performance of alternative model selection criteria. The sBIC is a generalization of the standard BIC that can be implemented to singular statistical models. The comparison is based on Monte Carlo simulations and carried out for several alternative settings, varying with respect to the number of topics, the number of documents and the size of documents in the corpora. Performance is measured using different criteria which take into account the correct number of topics, but also whether the relevant topics from the DGPs are identified. Practical recommendations for LDA model selection in applications are derived

    TOPIC MODELLING METHODOLOGY: ITS USE IN INFORMATION SYSTEMS AND OTHER MANAGERIAL DISCIPLINES

    Get PDF
    Over the last decade, quantitative text mining approaches to content analysis have gained increasing traction within information systems research, and related fields, such as business administration. Recently, topic models, which are supposed to provide their user with an overview of themes being dis-cussed in documents, have gained popularity. However, while convenient tools for the creation of this model class exist, the evaluation of topic models poses significant challenges to their users. In this research, we investigate how questions of model validity and trustworthiness of presented analyses are addressed across disciplines. We accomplish this by providing a structured review of methodological approaches across the Financial Times 50 journal ranking. We identify 59 methodological research papers, 24 implementations of topic models, as well as 33 research papers using topic models in In-formation Systems (IS) research, and 29 papers using such models in other managerial disciplines. Results indicate a need for model implementations usable by a wider audience, as well as the need for more implementations of model validation techniques, and the need for a discussion about the theoretical foundations of topic modelling based research

    Validation of scientific topic models using graph analysis and corpus metadata

    Get PDF
    Probabilistic topic modeling algorithms like Latent Dirichlet Allocation (LDA) have become powerful tools for the analysis of large collections of documents (such as papers, projects, or funding applications) in science, technology an innovation (STI) policy design and monitoring. However, selecting an appropriate and stable topic model for a specific application (by adjusting the hyperparameters of the algorithm) is not a trivial problem. Common validation metrics like coherence or perplexity, which are focused on the quality of topics, are not a good fit in applications where the quality of the document similarity relations inferred from the topic model is especially relevant. Relying on graph analysis techniques, the aim of our work is to state a new methodology for the selection of hyperparameters which is specifically oriented to optimize the similarity metrics emanating from the topic model. In order to do this, we propose two graph metrics: the first measures the variability of the similarity graphs that result from different runs of the algorithm for a fixed value of the hyperparameters, while the second metric measures the alignment between the graph derived from the LDA model and another obtained using metadata available for the corresponding corpus. Through experiments on various corpora related to STI, it is shown that the proposed metrics provide relevant indicators to select the number of topics and build persistent topic models that are consistent with the metadata. Their use, which can be extended to other topic models beyond LDA, could facilitate the systematic adoption of this kind of techniques in STI policy analysis and design.Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 101004870 (IntelComp project), and has also been partially supported by FEDER/ Spanish Ministry of Science, Innovation and Universities, State Agency of Research, project TEC2017-83838-R

    Hierarchical viewpoint discovery from tweets using Bayesian modelling

    Get PDF
    When users express their stances towards a topic in social media, they might elaborate their viewpoints or reasoning. Oftentimes, viewpoints expressed by different users exhibit a hierarchical structure. Therefore, detecting this kind of hierarchical viewpoints offers a better insight to understand the public opinion. In this paper, we propose a novel Bayesian model for hierarchical viewpoint discovery from tweets. Driven by the motivation that a viewpoint expressed in a tweet can be regarded as a path from the root to a leaf of a hierarchical viewpoint tree, the assignment of the relevant viewpoint topics is assumed to follow two nested Chinese restaurant processes. Moreover, opinions in text are often expressed in un-semantically decomposable multi-terms or phrases, such as ‘economic recession’. Hence, a hierarchical Pitman–Yor process is employed as a prior for modelling the generation of phrases with arbitrary length. Experimental results on two Twitter corpora demonstrate the effectiveness of the proposed Bayesian model for hierarchical viewpoint discovery
    • 

    corecore