17,373 research outputs found

    Embed and Conquer: Scalable Embeddings for Kernel k-Means on MapReduce

    Full text link
    The kernel kk-means is an effective method for data clustering which extends the commonly-used kk-means algorithm to work on a similarity matrix over complex data structures. The kernel kk-means algorithm is however computationally very complex as it requires the complete data matrix to be calculated and stored. Further, the kernelized nature of the kernel kk-means algorithm hinders the parallelization of its computations on modern infrastructures for distributed computing. In this paper, we are defining a family of kernel-based low-dimensional embeddings that allows for scaling kernel kk-means on MapReduce via an efficient and unified parallelization strategy. Afterwards, we propose two methods for low-dimensional embedding that adhere to our definition of the embedding family. Exploiting the proposed parallelization strategy, we present two scalable MapReduce algorithms for kernel kk-means. We demonstrate the effectiveness and efficiency of the proposed algorithms through an empirical evaluation on benchmark data sets.Comment: Appears in Proceedings of the SIAM International Conference on Data Mining (SDM), 201

    Efficient Semidefinite Spectral Clustering via Lagrange Duality

    Full text link
    We propose an efficient approach to semidefinite spectral clustering (SSC), which addresses the Frobenius normalization with the positive semidefinite (p.s.d.) constraint for spectral clustering. Compared with the original Frobenius norm approximation based algorithm, the proposed algorithm can more accurately find the closest doubly stochastic approximation to the affinity matrix by considering the p.s.d. constraint. In this paper, SSC is formulated as a semidefinite programming (SDP) problem. In order to solve the high computational complexity of SDP, we present a dual algorithm based on the Lagrange dual formalization. Two versions of the proposed algorithm are proffered: one with less memory usage and the other with faster convergence rate. The proposed algorithm has much lower time complexity than that of the standard interior-point based SDP solvers. Experimental results on both UCI data sets and real-world image data sets demonstrate that 1) compared with the state-of-the-art spectral clustering methods, the proposed algorithm achieves better clustering performance; and 2) our algorithm is much more efficient and can solve larger-scale SSC problems than those standard interior-point SDP solvers.Comment: 13 page

    Deep Divergence-Based Approach to Clustering

    Get PDF
    A promising direction in deep learning research consists in learning representations and simultaneously discovering cluster structure in unlabeled data by optimizing a discriminative loss function. As opposed to supervised deep learning, this line of research is in its infancy, and how to design and optimize suitable loss functions to train deep neural networks for clustering is still an open question. Our contribution to this emerging field is a new deep clustering network that leverages the discriminative power of information-theoretic divergence measures, which have been shown to be effective in traditional clustering. We propose a novel loss function that incorporates geometric regularization constraints, thus avoiding degenerate structures of the resulting clustering partition. Experiments on synthetic benchmarks and real datasets show that the proposed network achieves competitive performance with respect to other state-of-the-art methods, scales well to large datasets, and does not require pre-training steps

    Kernel Spectral Clustering and applications

    Full text link
    In this chapter we review the main literature related to kernel spectral clustering (KSC), an approach to clustering cast within a kernel-based optimization setting. KSC represents a least-squares support vector machine based formulation of spectral clustering described by a weighted kernel PCA objective. Just as in the classifier case, the binary clustering model is expressed by a hyperplane in a high dimensional space induced by a kernel. In addition, the multi-way clustering can be obtained by combining a set of binary decision functions via an Error Correcting Output Codes (ECOC) encoding scheme. Because of its model-based nature, the KSC method encompasses three main steps: training, validation, testing. In the validation stage model selection is performed to obtain tuning parameters, like the number of clusters present in the data. This is a major advantage compared to classical spectral clustering where the determination of the clustering parameters is unclear and relies on heuristics. Once a KSC model is trained on a small subset of the entire data, it is able to generalize well to unseen test points. Beyond the basic formulation, sparse KSC algorithms based on the Incomplete Cholesky Decomposition (ICD) and L0L_0, L1,L0+L1L_1, L_0 + L_1, Group Lasso regularization are reviewed. In that respect, we show how it is possible to handle large scale data. Also, two possible ways to perform hierarchical clustering and a soft clustering method are presented. Finally, real-world applications such as image segmentation, power load time-series clustering, document clustering and big data learning are considered.Comment: chapter contribution to the book "Unsupervised Learning Algorithms

    Clustering via kernel decomposition

    Get PDF
    Spectral clustering methods were proposed recently which rely on the eigenvalue decomposition of an affinity matrix. In this letter, the affinity matrix is created from the elements of a nonparametric density estimator and then decomposed to obtain posterior probabilities of class membership. Hyperparameters are selected using standard cross-validation methods
    • …
    corecore