60,483 research outputs found
On Randomly Projected Hierarchical Clustering with Guarantees
Hierarchical clustering (HC) algorithms are generally limited to small data
instances due to their runtime costs. Here we mitigate this shortcoming and
explore fast HC algorithms based on random projections for single (SLC) and
average (ALC) linkage clustering as well as for the minimum spanning tree
problem (MST). We present a thorough adaptive analysis of our algorithms that
improve prior work from by up to a factor of for a
dataset of points in Euclidean space. The algorithms maintain, with
arbitrary high probability, the outcome of hierarchical clustering as well as
the worst-case running-time guarantees. We also present parameter-free
instances of our algorithms.Comment: This version contains the conference paper "On Randomly Projected
Hierarchical Clustering with Guarantees'', SIAM International Conference on
Data Mining (SDM), 2014 and, additionally, proofs omitted in the conference
versio
Spatio-Temporal Surrogates for Interaction of a Jet with High Explosives: Part II -- Clustering Extremely High-Dimensional Grid-Based Data
Building an accurate surrogate model for the spatio-temporal outputs of a
computer simulation is a challenging task. A simple approach to improve the
accuracy of the surrogate is to cluster the outputs based on similarity and
build a separate surrogate model for each cluster. This clustering is
relatively straightforward when the output at each time step is of moderate
size. However, when the spatial domain is represented by a large number of grid
points, numbering in the millions, the clustering of the data becomes more
challenging. In this report, we consider output data from simulations of a jet
interacting with high explosives. These data are available on spatial domains
of different sizes, at grid points that vary in their spatial coordinates, and
in a format that distributes the output across multiple files at each time step
of the simulation. We first describe how we bring these data into a consistent
format prior to clustering. Borrowing the idea of random projections from data
mining, we reduce the dimension of our data by a factor of thousand, making it
possible to use the iterative k-means method for clustering. We show how we can
use the randomness of both the random projections, and the choice of initial
centroids in k-means clustering, to determine the number of clusters in our
data set. Our approach makes clustering of extremely high dimensional data
tractable, generating meaningful cluster assignments for our problem, despite
the approximation introduced in the random projections
Efficient Clustering on Riemannian Manifolds: A Kernelised Random Projection Approach
Reformulating computer vision problems over Riemannian manifolds has
demonstrated superior performance in various computer vision applications. This
is because visual data often forms a special structure lying on a lower
dimensional space embedded in a higher dimensional space. However, since these
manifolds belong to non-Euclidean topological spaces, exploiting their
structures is computationally expensive, especially when one considers the
clustering analysis of massive amounts of data. To this end, we propose an
efficient framework to address the clustering problem on Riemannian manifolds.
This framework implements random projections for manifold points via kernel
space, which can preserve the geometric structure of the original space, but is
computationally efficient. Here, we introduce three methods that follow our
framework. We then validate our framework on several computer vision
applications by comparing against popular clustering methods on Riemannian
manifolds. Experimental results demonstrate that our framework maintains the
performance of the clustering whilst massively reducing computational
complexity by over two orders of magnitude in some cases
Hidden Variables in Bipartite Networks
We introduce and study random bipartite networks with hidden variables. Nodes
in these networks are characterized by hidden variables which control the
appearance of links between node pairs. We derive analytic expressions for the
degree distribution, degree correlations, the distribution of the number of
common neighbors, and the bipartite clustering coefficient in these networks.
We also establish the relationship between degrees of nodes in original
bipartite networks and in their unipartite projections. We further demonstrate
how hidden variable formalism can be applied to analyze topological properties
of networks in certain bipartite network models, and verify our analytical
results in numerical simulations
- …