41,641 research outputs found
Inference and Evaluation of the Multinomial Mixture Model for Text Clustering
In this article, we investigate the use of a probabilistic model for
unsupervised clustering in text collections. Unsupervised clustering has become
a basic module for many intelligent text processing applications, such as
information retrieval, text classification or information extraction. The model
considered in this contribution consists of a mixture of multinomial
distributions over the word counts, each component corresponding to a different
theme. We present and contrast various estimation procedures, which apply both
in supervised and unsupervised contexts. In supervised learning, this work
suggests a criterion for evaluating the posterior odds of new documents which
is more statistically sound than the "naive Bayes" approach. In an unsupervised
context, we propose measures to set up a systematic evaluation framework and
start with examining the Expectation-Maximization (EM) algorithm as the basic
tool for inference. We discuss the importance of initialization and the
influence of other features such as the smoothing strategy or the size of the
vocabulary, thereby illustrating the difficulties incurred by the high
dimensionality of the parameter space. We also propose a heuristic algorithm
based on iterative EM with vocabulary reduction to solve this problem. Using
the fact that the latent variables can be analytically integrated out, we
finally show that Gibbs sampling algorithm is tractable and compares favorably
to the basic expectation maximization approach
Arriving on time: estimating travel time distributions on large-scale road networks
Most optimal routing problems focus on minimizing travel time or distance
traveled. Oftentimes, a more useful objective is to maximize the probability of
on-time arrival, which requires statistical distributions of travel times,
rather than just mean values. We propose a method to estimate travel time
distributions on large-scale road networks, using probe vehicle data collected
from GPS. We present a framework that works with large input of data, and
scales linearly with the size of the network. Leveraging the planar topology of
the graph, the method computes efficiently the time correlations between
neighboring streets. First, raw probe vehicle traces are compressed into pairs
of travel times and number of stops for each traversed road segment using a
`stop-and-go' algorithm developed for this work. The compressed data is then
used as input for training a path travel time model, which couples a Markov
model along with a Gaussian Markov random field. Finally, scalable inference
algorithms are developed for obtaining path travel time distributions from the
composite MM-GMRF model. We illustrate the accuracy and scalability of our
model on a 505,000 road link network spanning the San Francisco Bay Area
- …