99 research outputs found

    Clustering Large Graphs via the Singular Value Decomposition

    Full text link

    Spatial Random Sampling: A Structure-Preserving Data Sketching Tool

    Full text link
    Random column sampling is not guaranteed to yield data sketches that preserve the underlying structures of the data and may not sample sufficiently from less-populated data clusters. Also, adaptive sampling can often provide accurate low rank approximations, yet may fall short of producing descriptive data sketches, especially when the cluster centers are linearly dependent. Motivated by that, this paper introduces a novel randomized column sampling tool dubbed Spatial Random Sampling (SRS), in which data points are sampled based on their proximity to randomly sampled points on the unit sphere. The most compelling feature of SRS is that the corresponding probability of sampling from a given data cluster is proportional to the surface area the cluster occupies on the unit sphere, independently from the size of the cluster population. Although it is fully randomized, SRS is shown to provide descriptive and balanced data representations. The proposed idea addresses a pressing need in data science and holds potential to inspire many novel approaches for analysis of big data

    Fast Color Quantization Using Weighted Sort-Means Clustering

    Full text link
    Color quantization is an important operation with numerous applications in graphics and image processing. Most quantization methods are essentially based on data clustering algorithms. However, despite its popularity as a general purpose clustering algorithm, k-means has not received much respect in the color quantization literature because of its high computational requirements and sensitivity to initialization. In this paper, a fast color quantization method based on k-means is presented. The method involves several modifications to the conventional (batch) k-means algorithm including data reduction, sample weighting, and the use of triangle inequality to speed up the nearest neighbor search. Experiments on a diverse set of images demonstrate that, with the proposed modifications, k-means becomes very competitive with state-of-the-art color quantization methods in terms of both effectiveness and efficiency.Comment: 30 pages, 2 figures, 4 table

    An Approximation Ratio for Biclustering

    Full text link
    The problem of biclustering consists of the simultaneous clustering of rows and columns of a matrix such that each of the submatrices induced by a pair of row and column clusters is as uniform as possible. In this paper we approximate the optimal biclustering by applying one-way clustering algorithms independently on the rows and on the columns of the input matrix. We show that such a solution yields a worst-case approximation ratio of 1+sqrt(2) under L1-norm for 0-1 valued matrices, and of 2 under L2-norm for real valued matrices.Comment: 9 pages, 2 figures; presentation clarified, replaced to match the version to be published in IP

    Network Kriging

    Full text link
    Network service providers and customers are often concerned with aggregate performance measures that span multiple network paths. Unfortunately, forming such network-wide measures can be difficult, due to the issues of scale involved. In particular, the number of paths grows too rapidly with the number of endpoints to make exhaustive measurement practical. As a result, it is of interest to explore the feasibility of methods that dramatically reduce the number of paths measured in such situations while maintaining acceptable accuracy. We cast the problem as one of statistical prediction--in the spirit of the so-called `kriging' problem in spatial statistics--and show that end-to-end network properties may be accurately predicted in many cases using a surprisingly small set of carefully chosen paths. More precisely, we formulate a general framework for the prediction problem, propose a class of linear predictors for standard quantities of interest (e.g., averages, totals, differences) and show that linear algebraic methods of subset selection may be used to effectively choose which paths to measure. We characterize the performance of the resulting methods, both analytically and numerically. The success of our methods derives from the low effective rank of routing matrices as encountered in practice, which appears to be a new observation in its own right with potentially broad implications on network measurement generally.Comment: 16 pages, 9 figures, single-space
    corecore