Search CORE

31 research outputs found

Recommended from our members

Mixtures of Hierarchical Topics with Pachinko Allocation

Author: Mimno David
Publication venue: ScholarWorks@UMass Amherst
Publication date: 01/01/2007
Field of study

The four-level pachinko allocation model (PAM) (Li & McCallum, 2006) represents correlations among topics using a DAG struc- ture. It does not, however, represent a nested hierarchy of topics, with some top- ical word distributions representing the vo- cabulary that is shared among several more specic topics. This paper presents hierar- chical PAM|an enhancement that explic- itly represents a topic hierarchy. This model can be seen as combining the advantages of hLDA\u27s topical hierarchy representation with PAM\u27s ability to mix multiple leaves of the topic hierarchy. Experimental results show improvements in likelihood of held-out docu- ments, as well as mutual information between automatically-discovered topics and human- generated categories such as journals

ScholarWorks@UMass Amherst

A Novel Document Generation Process for Topic Detection based on Hierarchical Latent Tree Models

Author: DJ Bartholomew
DM Blei
DM Blei
G Lubke
J Paisley
J Pearl
MA Sato
O Cappé
P Chen
T Liu
Publication venue
Publication date: 27/06/2019
Field of study

We propose a novel document generation process based on hierarchical latent tree models (HLTMs) learned from data. An HLTM has a layer of observed word variables at the bottom and multiple layers of latent variables on top. For each document, we first sample values for the latent variables layer by layer via logic sampling, then draw relative frequencies for the words conditioned on the values of the latent variables, and finally generate words for the document using the relative word frequencies. The motivation for the work is to take word counts into consideration with HLTMs. In comparison with LDA-based hierarchical document generation processes, the new process achieves drastically better model fit with much fewer parameters. It also yields more meaningful topics and topic hierarchies. It is the new state-of-the-art for the hierarchical topic detection

arXiv.org e-Print Archive

Crossref

Novelty Detection in Sequential Data by Informed Clustering and Modeling

Author: Adilova Linara
Chen Siming
Kamp Michael
Publication venue
Publication date: 05/03/2021
Field of study

Novelty detection in discrete sequences is a challenging task, since deviations from the process generating the normal data are often small or intentionally hidden. Novelties can be detected by modeling normal sequences and measuring the deviations of a new sequence from the model predictions. However, in many applications data is generated by several distinct processes so that models trained on all the data tend to over-generalize and novelties remain undetected. We propose to approach this challenge through decomposition: by clustering the data we break down the problem, obtaining simpler modeling task in each cluster which can be modeled more accurately. However, this comes at a trade-off, since the amount of training data per cluster is reduced. This is a particular problem for discrete sequences where state-of-the-art models are data-hungry. The success of this approach thus depends on the quality of the clustering, i.e., whether the individual learning problems are sufficiently simpler than the joint problem. While clustering discrete sequences automatically is a challenging and domain-specific task, it is often easy for human domain experts, given the right tools. In this paper, we adapt a state-of-the-art visual analytics tool for discrete sequence clustering to obtain informed clusters from domain experts and use LSTMs to model each cluster individually. Our extensive empirical evaluation indicates that this informed clustering outperforms automatic ones and that our approach outperforms state-of-the-art novelty detection methods for discrete sequences in three real-world application scenarios. In particular, decomposition outperforms a global model despite less training data on each individual cluster

arXiv.org e-Print Archive