19 research outputs found
ANTM: An Aligned Neural Topic Model for Exploring Evolving Topics
This paper presents an algorithmic family of dynamic topic models called
Aligned Neural Topic Models (ANTM), which combine novel data mining algorithms
to provide a modular framework for discovering evolving topics. ANTM maintains
the temporal continuity of evolving topics by extracting time-aware features
from documents using advanced pre-trained Large Language Models (LLMs) and
employing an overlapping sliding window algorithm for sequential document
clustering. This overlapping sliding window algorithm identifies a different
number of topics within each time frame and aligns semantically similar
document clusters across time periods. This process captures emerging and
fading trends across different periods and allows for a more interpretable
representation of evolving topics. Experiments on four distinct datasets show
that ANTM outperforms probabilistic dynamic topic models in terms of topic
coherence and diversity metrics. Moreover, it improves the scalability and
flexibility of dynamic topic models by being accessible and adaptable to
different types of algorithms. Additionally, a Python package is developed for
researchers and scientists who wish to study the trends and evolving patterns
of topics in large-scale textual data
Unifying Community Detection Across Scales from Genomes to Landscapes
Biodiversity science encompasses multiple disciplines and biological scales from molecules to landscapes. Nevertheless, biodiversity data are often analyzed separately with discipline-specific methodologies, constraining resulting inferences to a single scale. To overcome this, we present a topic modeling framework to analyze community composition in cross-disciplinary datasets, including those generated from metagenomics, metabolomics, field ecology and remote sensing. Using topic models, we demonstrate how community detection in different datasets can inform the conservation of interacting plants and herbivores. We show how topic models can identify members of molecular, organismal and landscape-level communities that relate to wildlife health, from gut microbes to forage quality. We conclude with a future vision for how topic modeling can be used to design cross-scale studies that promote a holistic approach to detect, monitor and manage biodiversity
Topic Modeling on Health Journals with Regularized Variational Inference
Topic modeling enables exploration and compact representation of a corpus.
The CaringBridge (CB) dataset is a massive collection of journals written by
patients and caregivers during a health crisis. Topic modeling on the CB
dataset, however, is challenging due to the asynchronous nature of multiple
authors writing about their health journeys. To overcome this challenge we
introduce the Dynamic Author-Persona topic model (DAP), a probabilistic
graphical model designed for temporal corpora with multiple authors. The
novelty of the DAP model lies in its representation of authors by a persona ---
where personas capture the propensity to write about certain topics over time.
Further, we present a regularized variational inference algorithm, which we use
to encourage the DAP model's personas to be distinct. Our results show
significant improvements over competing topic models --- particularly after
regularization, and highlight the DAP model's unique ability to capture common
journeys shared by different authors.Comment: Published in Thirty-Second AAAI Conference on Artificial
Intelligence, February 2018, New Orleans, Louisiana, US
Bitcoin Volatility Forecasting with a Glimpse into Buy and Sell Orders
In this paper, we study the ability to make the short-term prediction of the
exchange price fluctuations towards the United States dollar for the Bitcoin
market. We use the data of realized volatility collected from one of the
largest Bitcoin digital trading offices in 2016 and 2017 as well as order
information. Experiments are performed to evaluate a variety of statistical and
machine learning approaches.Comment: Full version of the paper published at IEEE International Conference
on Data Mining (ICDM), 201