11,496 research outputs found

    Embed and Conquer: Scalable Embeddings for Kernel k-Means on MapReduce

    Full text link
    The kernel kk-means is an effective method for data clustering which extends the commonly-used kk-means algorithm to work on a similarity matrix over complex data structures. The kernel kk-means algorithm is however computationally very complex as it requires the complete data matrix to be calculated and stored. Further, the kernelized nature of the kernel kk-means algorithm hinders the parallelization of its computations on modern infrastructures for distributed computing. In this paper, we are defining a family of kernel-based low-dimensional embeddings that allows for scaling kernel kk-means on MapReduce via an efficient and unified parallelization strategy. Afterwards, we propose two methods for low-dimensional embedding that adhere to our definition of the embedding family. Exploiting the proposed parallelization strategy, we present two scalable MapReduce algorithms for kernel kk-means. We demonstrate the effectiveness and efficiency of the proposed algorithms through an empirical evaluation on benchmark data sets.Comment: Appears in Proceedings of the SIAM International Conference on Data Mining (SDM), 201

    Critical Networks Exhibit Maximal Information Diversity in Structure-Dynamics Relationships

    Full text link
    Network structure strongly constrains the range of dynamic behaviors available to a complex system. These system dynamics can be classified based on their response to perturbations over time into two distinct regimes, ordered or chaotic, separated by a critical phase transition. Numerous studies have shown that the most complex dynamics arise near the critical regime. Here we use an information theoretic approach to study structure-dynamics relationships within a unified framework and how that these relationships are most diverse in the critical regime

    Training Gaussian Mixture Models at Scale via Coresets

    Get PDF
    How can we train a statistical mixture model on a massive data set? In this work we show how to construct coresets for mixtures of Gaussians. A coreset is a weighted subset of the data, which guarantees that models fitting the coreset also provide a good fit for the original data set. We show that, perhaps surprisingly, Gaussian mixtures admit coresets of size polynomial in dimension and the number of mixture components, while being independent of the data set size. Hence, one can harness computationally intensive algorithms to compute a good approximation on a significantly smaller data set. More importantly, such coresets can be efficiently constructed both in distributed and streaming settings and do not impose restrictions on the data generating process. Our results rely on a novel reduction of statistical estimation to problems in computational geometry and new combinatorial complexity results for mixtures of Gaussians. Empirical evaluation on several real-world datasets suggests that our coreset-based approach enables significant reduction in training-time with negligible approximation error
    • …
    corecore