6,465 research outputs found
Temporal and Spatial Data Mining with Second-Order Hidden Models
In the frame of designing a knowledge discovery system, we have developed
stochastic models based on high-order hidden Markov models. These models are
capable to map sequences of data into a Markov chain in which the transitions
between the states depend on the \texttt{n} previous states according to the
order of the model. We study the process of achieving information extraction
fromspatial and temporal data by means of an unsupervised classification. We
use therefore a French national database related to the land use of a region,
named Teruti, which describes the land use both in the spatial and temporal
domain. Land-use categories (wheat, corn, forest, ...) are logged every year on
each site regularly spaced in the region. They constitute a temporal sequence
of images in which we look for spatial and temporal dependencies. The temporal
segmentation of the data is done by means of a second-order Hidden Markov Model
(\hmmd) that appears to have very good capabilities to locate stationary
segments, as shown in our previous work in speech recognition. Thespatial
classification is performed by defining a fractal scanning ofthe images with
the help of a Hilbert-Peano curve that introduces atotal order on the sites,
preserving the relation ofneighborhood between the sites. We show that the
\hmmd performs aclassification that is meaningful for the agronomists.Spatial
and temporal classification may be achieved simultaneously by means of a 2
levels \hmmd that measures the \aposteriori probability to map a temporal
sequence of images onto a set of hidden classes
Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression
Although fully generative models have been successfully used to model the
contents of text documents, they are often awkward to apply to combinations of
text data and document metadata. In this paper we propose a
Dirichlet-multinomial regression (DMR) topic model that includes a log-linear
prior on document-topic distributions that is a function of observed features
of the document, such as author, publication venue, references, and dates. We
show that by selecting appropriate features, DMR topic models can meet or
exceed the performance of several previously published topic models designed
for specific data.Comment: Appears in Proceedings of the Twenty-Fourth Conference on Uncertainty
in Artificial Intelligence (UAI2008
Bibliographic Analysis on Research Publications using Authors, Categorical Labels and the Citation Network
Bibliographic analysis considers the author's research areas, the citation
network and the paper content among other things. In this paper, we combine
these three in a topic model that produces a bibliographic model of authors,
topics and documents, using a nonparametric extension of a combination of the
Poisson mixed-topic link model and the author-topic model. This gives rise to
the Citation Network Topic Model (CNTM). We propose a novel and efficient
inference algorithm for the CNTM to explore subsets of research publications
from CiteSeerX. The publication datasets are organised into three corpora,
totalling to about 168k publications with about 62k authors. The queried
datasets are made available online. In three publicly available corpora in
addition to the queried datasets, our proposed model demonstrates an improved
performance in both model fitting and document clustering, compared to several
baselines. Moreover, our model allows extraction of additional useful knowledge
from the corpora, such as the visualisation of the author-topics network.
Additionally, we propose a simple method to incorporate supervision into topic
modelling to achieve further improvement on the clustering task.Comment: Preprint for Journal Machine Learnin
- …