99 research outputs found
Spatial Random Sampling: A Structure-Preserving Data Sketching Tool
Random column sampling is not guaranteed to yield data sketches that preserve
the underlying structures of the data and may not sample sufficiently from
less-populated data clusters. Also, adaptive sampling can often provide
accurate low rank approximations, yet may fall short of producing descriptive
data sketches, especially when the cluster centers are linearly dependent.
Motivated by that, this paper introduces a novel randomized column sampling
tool dubbed Spatial Random Sampling (SRS), in which data points are sampled
based on their proximity to randomly sampled points on the unit sphere. The
most compelling feature of SRS is that the corresponding probability of
sampling from a given data cluster is proportional to the surface area the
cluster occupies on the unit sphere, independently from the size of the cluster
population. Although it is fully randomized, SRS is shown to provide
descriptive and balanced data representations. The proposed idea addresses a
pressing need in data science and holds potential to inspire many novel
approaches for analysis of big data
Fast Color Quantization Using Weighted Sort-Means Clustering
Color quantization is an important operation with numerous applications in
graphics and image processing. Most quantization methods are essentially based
on data clustering algorithms. However, despite its popularity as a general
purpose clustering algorithm, k-means has not received much respect in the
color quantization literature because of its high computational requirements
and sensitivity to initialization. In this paper, a fast color quantization
method based on k-means is presented. The method involves several modifications
to the conventional (batch) k-means algorithm including data reduction, sample
weighting, and the use of triangle inequality to speed up the nearest neighbor
search. Experiments on a diverse set of images demonstrate that, with the
proposed modifications, k-means becomes very competitive with state-of-the-art
color quantization methods in terms of both effectiveness and efficiency.Comment: 30 pages, 2 figures, 4 table
An Approximation Ratio for Biclustering
The problem of biclustering consists of the simultaneous clustering of rows
and columns of a matrix such that each of the submatrices induced by a pair of
row and column clusters is as uniform as possible. In this paper we approximate
the optimal biclustering by applying one-way clustering algorithms
independently on the rows and on the columns of the input matrix. We show that
such a solution yields a worst-case approximation ratio of 1+sqrt(2) under
L1-norm for 0-1 valued matrices, and of 2 under L2-norm for real valued
matrices.Comment: 9 pages, 2 figures; presentation clarified, replaced to match the
version to be published in IP
Network Kriging
Network service providers and customers are often concerned with aggregate
performance measures that span multiple network paths. Unfortunately, forming
such network-wide measures can be difficult, due to the issues of scale
involved. In particular, the number of paths grows too rapidly with the number
of endpoints to make exhaustive measurement practical. As a result, it is of
interest to explore the feasibility of methods that dramatically reduce the
number of paths measured in such situations while maintaining acceptable
accuracy.
We cast the problem as one of statistical prediction--in the spirit of the
so-called `kriging' problem in spatial statistics--and show that end-to-end
network properties may be accurately predicted in many cases using a
surprisingly small set of carefully chosen paths. More precisely, we formulate
a general framework for the prediction problem, propose a class of linear
predictors for standard quantities of interest (e.g., averages, totals,
differences) and show that linear algebraic methods of subset selection may be
used to effectively choose which paths to measure. We characterize the
performance of the resulting methods, both analytically and numerically. The
success of our methods derives from the low effective rank of routing matrices
as encountered in practice, which appears to be a new observation in its own
right with potentially broad implications on network measurement generally.Comment: 16 pages, 9 figures, single-space
- …