45 research outputs found

    Coresets for Fuzzy K-Means with Applications

    Get PDF
    The fuzzy K-means problem is a popular generalization of the well-known K-means problem to soft clusterings. We present the first coresets for fuzzy K-means with size linear in the dimension, polynomial in the number of clusters, and poly-logarithmic in the number of points. We show that these coresets can be employed in the computation of a (1+epsilon)-approximation for fuzzy K-means, improving previously presented results. We further show that our coresets can be maintained in an insertion-only streaming setting, where data points arrive one-by-one

    Coresets for Time Series Clustering

    Get PDF
    We study the problem of constructing coresets for clustering problems with time series data. This problem has gained importance across many fields including biology, medicine, and economics due to the proliferation of sensors for real-time measurement and rapid drop in storage costs. In particular, we consider the setting where the time series data on N entities is generated from a Gaussian mixture model with autocorrelations over k clusters in Rd. Our main contribution is an algorithm to construct coresets for the maximum likelihood objective for this mixture model. Our algorithm is efficient, and, under a mild assumption on the covariance matrices of the Gaussians, the size of the coreset is independent of the number of entities N and the number of observations for each entity, and depends only polynomially on k, d and 1/ε, where ε is the error parameter. We empirically assess the performance of our coresets with synthetic data

    Coresets for Time Series Clustering

    Get PDF
    We study the problem of constructing coresets for clustering problems with time series data. This problem has gained importance across many fields including biology, medicine, and economics due to the proliferation of sensors for real-time measurement and rapid drop in storage costs. In particular, we consider the setting where the time series data on N entities is generated from a Gaussian mixture model with autocorrelations over k clusters in Rd. Our main contribution is an algorithm to construct coresets for the maximum likelihood objective for this mixture model. Our algorithm is efficient, and, under a mild assumption on the covariance matrices of the Gaussians, the size of the coreset is independent of the number of entities N and the number of observations for each entity, and depends only polynomially on k, d and 1/ε, where ε is the error parameter. We empirically assess the performance of our coresets with synthetic data

    Coresets for minimum enclosing balls over sliding windows

    Get PDF
    \emph{Coresets} are important tools to generate concise summaries of massive datasets for approximate analysis. A coreset is a small subset of points extracted from the original point set such that certain geometric properties are preserved with provable guarantees. This paper investigates the problem of maintaining a coreset to preserve the minimum enclosing ball (MEB) for a sliding window of points that are continuously updated in a data stream. Although the problem has been extensively studied in batch and append-only streaming settings, no efficient sliding-window solution is available yet. In this work, we first introduce an algorithm, called AOMEB, to build a coreset for MEB in an append-only stream. AOMEB improves the practical performance of the state-of-the-art algorithm while having the same approximation ratio. Furthermore, using AOMEB as a building block, we propose two novel algorithms, namely SWMEB and SWMEB+, to maintain coresets for MEB over the sliding window with constant approximation ratios. The proposed algorithms also support coresets for MEB in a reproducing kernel Hilbert space (RKHS). Finally, extensive experiments on real-world and synthetic datasets demonstrate that SWMEB and SWMEB+ achieve speedups of up to four orders of magnitude over the state-of-the-art batch algorithm while providing coresets for MEB with rather small errors compared to the optimal ones.Comment: 28 pages, 10 figures, to appear in The 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '19

    Coresets for Clustering: Foundations and Challenges

    Get PDF
    Clustering is a fundamental task in machine learning and data analysis. The main challenge for clustering in big data sets is that classical clustering algorithms often do not scale well. Coresets are data reduction techniques that turn big data into a tiny proxy. Prior research has shown that coresets can provide a scalable solution to clustering problems and imply streaming and distributed algorithms. In this work, we aim to solve a fundamental question and two modern challenges in coresets for clustering. \textsf{Beyond Euclidean Space}: Coresets for Clustering in Euclidean space have been well studied and coresets of constant size are known to exist. While very few results are known beyond Euclidean space. It becomes a fundamental problem that what kind of metric space admits constant-sized coresets for clustering. We focus on graph metrics which is a common ambient space for clustering. We provide positive results that assert constant-sized coresets exist in various families of graph metrics including graphs of bounded treewidth, planar graphs and the more general excluded-minor graphs. \textsf{Missing Value}: Missing value is a common phenomenon in real data sets. Clustering under the existence of missing values is a very challenging task. In this work, we construct the first coresets for clustering with multiple missing values. Previously, such coresets were only known to exist when each data point has at most one missing value \cite{DBLP:conf/nips/MaromF19}. We further design a near-linear time algorithm to construct our coresets. This algorithm implies the first near-linear time approximation scheme for \kMeans clustering with missing values and improves a recent result by \cite{DBLP:conf/soda/EibenFGLPS21}. \textsf{Simultaneous Coresets}: Most classical coresets are limited to a specific clustering objective. When there are multiple potential objectives, a stronger notion of “simultaneous coresets” is needed. Simultaneous coresets provide the approximations for a family of objectives and can serve as a more flexible data reduction tool. In this work, we design the first simultaneous coresets for a large clustering family which includes both \kMedian and \kCenter

    K-means for massive data

    Get PDF
    145 p.The K-means algorithm is undoubtedly one of the most popular clustering analysis techniques, due to its easiness in the implementation, straightforward parallelizability and competitive computational complexity, when compared to more sophisticated clustering alternatives. Unfortunately, the progressive growth of the amount of data that needs to be analyzed, in a wide variety of scientific fields, represents a significant challenge for the K-means algorithm, since its time complexity is dominated by the number of distance computations, which is linear with respect to both the number of instances and dimensionality of the problem. This fact difficults its scalability on such massive data sets. Another major drawback of the K-means algorithm corresponds to its high dependency on the initial conditions, which not only may affect the quality of the obtained solution, but that may also have major impact on its computational load, as for instance, a poor initialization could lead to an exponential running time in the worst case scenario.In this dissertation we tackle all these difficulties. Initially, we propose an approximation to the K-means problem, the Recursive Partition-based K-means algorithm (RPKM). This approach consists of recursively applying a weighted version of K-means algorithm over a sequence of spatial-based partitions of the data set. From one iteration to the next, a more refined partition is constructed and the process is repeated using the optimal set of centroids, obtained at the previous iteration, as initialization. From practical stand point, such a process reduces the computational load of K-means algorithm as the number of representatives, at each iteration, is meant to be much smaller than the number of instances of the data set. On the other hand, both phases of the algorithm are embarrasingly parallel. From the theoretical standpoint, and in spite of the selected partition strategy, one can guarantee the non-repetition of the clusterings generated at each RPKM iteration, which ultimately implies the reduction of the total amount of K-means algorithm iterations, as well as leading, in most of the cases, to a monotone decrease of the overall error function. Afterwards, we report on a RPKM-type approach, the Boundary Weighted K-means algorithm (BWKM). For this technique the data set partition is based on an adaptative mesh, that adjusts the size of each grid cell to maximize the chances of each cell to have only instances of the same cluster. The goal is to focus most of the computational resources on those regions where it is harder to determine the correct cluster assignment of the original instances (which is the main source of error for our approximation). For such a construction, it can be proved that if all the cells of a spatial partition are well assigned (have instances of the same cluster) at the end of a BWKM step, then the obtained clustering is actually a fixed point of the K-means algorithm over the entire data set, which is generated after using only a small number of representatives in comparison to the actual size of the data set. Furthermore, if, for a certain step of BWKM, this property can be verified at consecutive weighted Lloyds iterations, then the error of our approximation also decreases monotonically. From the practical stand point, BWKM was compared to the state-of-the-art: K-means++, Forgy K-means, Markov Chain Monte Carlo K-means and Minibatch K-means. The obtained results show that BWKM commonly converged to solutions, with a relative error of under 1% with respect to the considered methods, while using a much smaller amount of distance computations (up to 7 orders of magnitude lower). Even when the computational cost of BWKM is linear with respect to the dimensionality, its error quality guarantees are mainly related to the diagonal length of the grid cells, meaning that, as we increase the dimensionality of the problem, it will be harder for BWKM to have such a competitive performance. Taking this into consideration, we developed a fully-parellelizable feature selection technique intended for the K-means algorithm, the Bounded Dimensional Distributed K-means algorithm (BDDKM). This approach consists of applying any heuristic for the K-means problem over multiple subsets of dimensions (each of which is bounded by a predefined constant, m<<d) and using the obtained clusterings to upper-bound the increase in the K-means error when deleting a given feature. We then select the features with the m largest error increase. Not only can each step of BDDKM be simply parallelized, but its computational cost is dominated by that of the selected heuristic (on m dimensions), which makes it a suitable dimensionality reduction alternative for BWKM on large data sets. Besides providing a theoretical bound for the obtained solution, via BDDKM, with respect the optimal K-means clustering, we analyze its performance in comparison to well-known feature selection and feature extraction techniques. Such an analysis shows BDDKM to consistently obtain results with lower K-means error than all the considered feature selection techniques: Laplacian scores, maximum variance and random selection, while also requiring similar or lower computational times than these approaches. Even more interesting, BDDKM, when compared to feature extraction techniques, such as Random Projections, also shows a noticeable improvement in both error and computational time. As a response to the high dependency of K-means algorithm to its initialization, we finally introduce a cheap Split-Merge step that can be used to re-start the K-means algorithm after reaching a fixed point, Split-Merge K-means (SMKM). Under some settings, one can show that this approach reduces the error of the given fixed point without requiring any further iteration of the K-means algorithm. Moreover, experimental results show that this strategy is able to generate approximations with an associated error that is hard to reach for different multi-start methods, such as multi-start Forgy K-means, K-means++ and Hartigan K-means. In particular,SMKM consistently generated the local minima with the lowest K-means error, reducing, on average, over 1 and 2 orders of magnitude of relative error with respect to K-means++ and Hartigan K-means and Forgy K-means, respectively. Not only does the quality of the solution obtained by SMKM tend to be much lower than the previously commented methods, but, in terms of computational resources, SMKM also required a much lower number of distance computations (about an order of magnitude less) to reach the lowest error that they achieved.bcam:basque center for applied mathematics Excelencia Severo Ocho

    K-means for massive data

    Get PDF
    145 p.The K-means algorithm is undoubtedly one of the most popular clustering analysis techniques, due to its easiness in the implementation, straightforward parallelizability and competitive computational complexity, when compared to more sophisticated clustering alternatives. Unfortunately, the progressive growth of the amount of data that needs to be analyzed, in a wide variety of scientific fields, represents a significant challenge for the K-means algorithm, since its time complexity is dominated by the number of distance computations, which is linear with respect to both the number of instances and dimensionality of the problem. This fact difficults its scalability on such massive data sets. Another major drawback of the K-means algorithm corresponds to its high dependency on the initial conditions, which not only may affect the quality of the obtained solution, but that may also have major impact on its computational load, as for instance, a poor initialization could lead to an exponential running time in the worst case scenario.In this dissertation we tackle all these difficulties. Initially, we propose an approximation to the K-means problem, the Recursive Partition-based K-means algorithm (RPKM). This approach consists of recursively applying a weighted version of K-means algorithm over a sequence of spatial-based partitions of the data set. From one iteration to the next, a more refined partition is constructed and the process is repeated using the optimal set of centroids, obtained at the previous iteration, as initialization. From practical stand point, such a process reduces the computational load of K-means algorithm as the number of representatives, at each iteration, is meant to be much smaller than the number of instances of the data set. On the other hand, both phases of the algorithm are embarrasingly parallel. From the theoretical standpoint, and in spite of the selected partition strategy, one can guarantee the non-repetition of the clusterings generated at each RPKM iteration, which ultimately implies the reduction of the total amount of K-means algorithm iterations, as well as leading, in most of the cases, to a monotone decrease of the overall error function. Afterwards, we report on a RPKM-type approach, the Boundary Weighted K-means algorithm (BWKM). For this technique the data set partition is based on an adaptative mesh, that adjusts the size of each grid cell to maximize the chances of each cell to have only instances of the same cluster. The goal is to focus most of the computational resources on those regions where it is harder to determine the correct cluster assignment of the original instances (which is the main source of error for our approximation). For such a construction, it can be proved that if all the cells of a spatial partition are well assigned (have instances of the same cluster) at the end of a BWKM step, then the obtained clustering is actually a fixed point of the K-means algorithm over the entire data set, which is generated after using only a small number of representatives in comparison to the actual size of the data set. Furthermore, if, for a certain step of BWKM, this property can be verified at consecutive weighted Lloyds iterations, then the error of our approximation also decreases monotonically. From the practical stand point, BWKM was compared to the state-of-the-art: K-means++, Forgy K-means, Markov Chain Monte Carlo K-means and Minibatch K-means. The obtained results show that BWKM commonly converged to solutions, with a relative error of under 1% with respect to the considered methods, while using a much smaller amount of distance computations (up to 7 orders of magnitude lower). Even when the computational cost of BWKM is linear with respect to the dimensionality, its error quality guarantees are mainly related to the diagonal length of the grid cells, meaning that, as we increase the dimensionality of the problem, it will be harder for BWKM to have such a competitive performance. Taking this into consideration, we developed a fully-parellelizable feature selection technique intended for the K-means algorithm, the Bounded Dimensional Distributed K-means algorithm (BDDKM). This approach consists of applying any heuristic for the K-means problem over multiple subsets of dimensions (each of which is bounded by a predefined constant, m<<d) and using the obtained clusterings to upper-bound the increase in the K-means error when deleting a given feature. We then select the features with the m largest error increase. Not only can each step of BDDKM be simply parallelized, but its computational cost is dominated by that of the selected heuristic (on m dimensions), which makes it a suitable dimensionality reduction alternative for BWKM on large data sets. Besides providing a theoretical bound for the obtained solution, via BDDKM, with respect the optimal K-means clustering, we analyze its performance in comparison to well-known feature selection and feature extraction techniques. Such an analysis shows BDDKM to consistently obtain results with lower K-means error than all the considered feature selection techniques: Laplacian scores, maximum variance and random selection, while also requiring similar or lower computational times than these approaches. Even more interesting, BDDKM, when compared to feature extraction techniques, such as Random Projections, also shows a noticeable improvement in both error and computational time. As a response to the high dependency of K-means algorithm to its initialization, we finally introduce a cheap Split-Merge step that can be used to re-start the K-means algorithm after reaching a fixed point, Split-Merge K-means (SMKM). Under some settings, one can show that this approach reduces the error of the given fixed point without requiring any further iteration of the K-means algorithm. Moreover, experimental results show that this strategy is able to generate approximations with an associated error that is hard to reach for different multi-start methods, such as multi-start Forgy K-means, K-means++ and Hartigan K-means. In particular,SMKM consistently generated the local minima with the lowest K-means error, reducing, on average, over 1 and 2 orders of magnitude of relative error with respect to K-means++ and Hartigan K-means and Forgy K-means, respectively. Not only does the quality of the solution obtained by SMKM tend to be much lower than the previously commented methods, but, in terms of computational resources, SMKM also required a much lower number of distance computations (about an order of magnitude less) to reach the lowest error that they achieved.bcam:basque center for applied mathematics Excelencia Severo Ocho
    corecore