19,864 research outputs found
Recommended from our members
COMPACT REPRESENTATIONS OF UNCERTAINTY IN CLUSTERING
Flat clustering and hierarchical clustering are two fundamental tasks, often used to discover meaningful structures in data, such as subtypes of cancer, phylogenetic relationships, taxonomies of concepts, and cascades of particle decays in particle physics. When multiple clusterings of the data are possible, it is useful to represent uncertainty in clustering through various probabilistic quantities, such as the distribution over partitions or tree structures, and the marginal probabilities of subpartitions or subtrees.
Many compact representations exist for structured prediction problems, enabling the efficient computation of probability distributions, e.g., a trellis structure and corresponding Forward-Backward algorithm for Markov models that model sequences. However, no such representation has been proposed for either flat or hierarchical clustering models. In this thesis, we present our work developing data structures and algorithms for computing probability distributions over flat and hierarchical clusterings, as well as for finding maximum a posteriori (MAP) flat and hierarchical clusterings, and various marginal probabilities, as given by a wide range of energy-based clustering models.
First, we describe a trellis structure that compactly represents distributions over flat or hierarchical clusterings. We also describe related data structures that represent approximate distributions. We then present algorithms that, using these structures, allow us to compute the partition function, MAP clustering, and the marginal proba- bilities of a cluster (and sub-hierarchy, in the case of hierarchical clustering) exactly. We also show how these and related algorithms can be used to approximate these values, and analyze the time and space complexity of our proposed methods. We demonstrate the utility of our approaches using various synthetic data of interest as well as in two real world applications, namely particle physics at the Large Hadron Collider at CERN and in cancer genomics. We conclude with a brief discussion of future work
A Probabilistic Embedding Clustering Method for Urban Structure Detection
Urban structure detection is a basic task in urban geography. Clustering is a
core technology to detect the patterns of urban spatial structure, urban
functional region, and so on. In big data era, diverse urban sensing datasets
recording information like human behaviour and human social activity, suffer
from complexity in high dimension and high noise. And unfortunately, the
state-of-the-art clustering methods does not handle the problem with high
dimension and high noise issues concurrently. In this paper, a probabilistic
embedding clustering method is proposed. Firstly, we come up with a
Probabilistic Embedding Model (PEM) to find latent features from high
dimensional urban sensing data by learning via probabilistic model. By latent
features, we could catch essential features hidden in high dimensional data
known as patterns; with the probabilistic model, we can also reduce uncertainty
caused by high noise. Secondly, through tuning the parameters, our model could
discover two kinds of urban structure, the homophily and structural
equivalence, which means communities with intensive interaction or in the same
roles in urban structure. We evaluated the performance of our model by
conducting experiments on real-world data and experiments with real data in
Shanghai (China) proved that our method could discover two kinds of urban
structure, the homophily and structural equivalence, which means clustering
community with intensive interaction or under the same roles in urban space.Comment: 6 pages, 7 figures, ICSDM201
Optimal Clustering under Uncertainty
Classical clustering algorithms typically either lack an underlying
probability framework to make them predictive or focus on parameter estimation
rather than defining and minimizing a notion of error. Recent work addresses
these issues by developing a probabilistic framework based on the theory of
random labeled point processes and characterizing a Bayes clusterer that
minimizes the number of misclustered points. The Bayes clusterer is analogous
to the Bayes classifier. Whereas determining a Bayes classifier requires full
knowledge of the feature-label distribution, deriving a Bayes clusterer
requires full knowledge of the point process. When uncertain of the point
process, one would like to find a robust clusterer that is optimal over the
uncertainty, just as one may find optimal robust classifiers with uncertain
feature-label distributions. Herein, we derive an optimal robust clusterer by
first finding an effective random point process that incorporates all
randomness within its own probabilistic structure and from which a Bayes
clusterer can be derived that provides an optimal robust clusterer relative to
the uncertainty. This is analogous to the use of effective class-conditional
distributions in robust classification. After evaluating the performance of
robust clusterers in synthetic mixtures of Gaussians models, we apply the
framework to granular imaging, where we make use of the asymptotic
granulometric moment theory for granular images to relate robust clustering
theory to the application.Comment: 19 pages, 5 eps figures, 1 tabl
Belief Hierarchical Clustering
In the data mining field many clustering methods have been proposed, yet
standard versions do not take into account uncertain databases. This paper
deals with a new approach to cluster uncertain data by using a hierarchical
clustering defined within the belief function framework. The main objective of
the belief hierarchical clustering is to allow an object to belong to one or
several clusters. To each belonging, a degree of belief is associated, and
clusters are combined based on the pignistic properties. Experiments with real
uncertain data show that our proposed method can be considered as a propitious
tool
Statistical Traffic State Analysis in Large-scale Transportation Networks Using Locality-Preserving Non-negative Matrix Factorization
Statistical traffic data analysis is a hot topic in traffic management and
control. In this field, current research progresses focus on analyzing traffic
flows of individual links or local regions in a transportation network. Less
attention are paid to the global view of traffic states over the entire
network, which is important for modeling large-scale traffic scenes. Our aim is
precisely to propose a new methodology for extracting spatio-temporal traffic
patterns, ultimately for modeling large-scale traffic dynamics, and long-term
traffic forecasting. We attack this issue by utilizing Locality-Preserving
Non-negative Matrix Factorization (LPNMF) to derive low-dimensional
representation of network-level traffic states. Clustering is performed on the
compact LPNMF projections to unveil typical spatial patterns and temporal
dynamics of network-level traffic states. We have tested the proposed method on
simulated traffic data generated for a large-scale road network, and reported
experimental results validate the ability of our approach for extracting
meaningful large-scale space-time traffic patterns. Furthermore, the derived
clustering results provide an intuitive understanding of spatial-temporal
characteristics of traffic flows in the large-scale network, and a basis for
potential long-term forecasting.Comment: IET Intelligent Transport Systems (2013
- …