24,575 research outputs found

    Scalable k-Means Clustering via Lightweight Coresets

    Full text link
    Coresets are compact representations of data sets such that models trained on a coreset are provably competitive with models trained on the full data set. As such, they have been successfully used to scale up clustering models to massive data sets. While existing approaches generally only allow for multiplicative approximation errors, we propose a novel notion of lightweight coresets that allows for both multiplicative and additive errors. We provide a single algorithm to construct lightweight coresets for k-means clustering as well as soft and hard Bregman clustering. The algorithm is substantially faster than existing constructions, embarrassingly parallel, and the resulting coresets are smaller. We further show that the proposed approach naturally generalizes to statistical k-means clustering and that, compared to existing results, it can be used to compute smaller summaries for empirical risk minimization. In extensive experiments, we demonstrate that the proposed algorithm outperforms existing data summarization strategies in practice.Comment: To appear in the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD

    Motif counting beyond five nodes

    Get PDF
    Counting graphlets is a well-studied problem in graph mining and social network analysis. Recently, several papers explored very simple and natural algorithms based on Monte Carlo sampling of Markov Chains (MC), and reported encouraging results. We show, perhaps surprisingly, that such algorithms are outperformed by color coding (CC) [2], a sophisticated algorithmic technique that we extend to the case of graphlet sampling and for which we prove strong statistical guarantees. Our computational experiments on graphs with millions of nodes show CC to be more accurate than MC; furthermore, we formally show that the mixing time of the MC approach is too high in general, even when the input graph has high conductance. All this comes at a price however. While MC is very efficient in terms of space, CC’s memory requirements become demanding when the size of the input graph and that of the graphlets grow. And yet, our experiments show that CC can push the limits of the state-of-the-art, both in terms of the size of the input graph and of that of the graphlets

    The Spectrum of Random Inner-product Kernel Matrices

    Full text link
    We consider n-by-n matrices whose (i, j)-th entry is f(X_i^T X_j), where X_1, ...,X_n are i.i.d. standard Gaussian random vectors in R^p, and f is a real-valued function. The eigenvalue distribution of these random kernel matrices is studied at the "large p, large n" regime. It is shown that, when p and n go to infinity, p/n = \gamma which is a constant, and f is properly scaled so that Var(f(X_i^T X_j)) is O(p^{-1}), the spectral density converges weakly to a limiting density on R. The limiting density is dictated by a cubic equation involving its Stieltjes transform. While for smooth kernel functions the limiting spectral density has been previously shown to be the Marcenko-Pastur distribution, our analysis is applicable to non-smooth kernel functions, resulting in a new family of limiting densities

    Consumption Risk-sharing in Social Networks

    Get PDF
    We build a model of informal risk-sharing among agents organized in a social network. A connection between individuals serves as collateral that can be used to enforce insurance payments. We characterize incentive compatible risk-sharing arrangements for any network structure, and develop two main results. (1) Expansive networks, where every group of agents have a large number of links with the rest of the community relative to the size of the group, facilitate better risk-sharing. In particular, “two-dimensional” village networks organized by geography are sufficiently expansive to allow very good risk-sharing. (2) In second-best arrangements, agents organize in endogenous “risksharing islands” in the network, where shocks are shared fully within but imperfectly across islands. As a result, risk-sharing in second-best arrangements is local: socially closer agents insure each other more. In an application of the model, we explore the spillover effect of development aid on the consumption of non-treated individuals.
    • …
    corecore