96,496 research outputs found

    Adaptive Initialization Method for K-Means Algorithm.

    Full text link
    The K-means algorithm is a widely used clustering algorithm that offers simplicity and efficiency. However, the traditional K-means algorithm uses a random method to determine the initial cluster centers, which make clustering results prone to local optima and then result in worse clustering performance. In this research, we propose an adaptive initialization method for the K-means algorithm (AIMK) which can adapt to the various characteristics in different datasets and obtain better clustering performance with stable results. For larger or higher-dimensional datasets, we even leverage random sampling in AIMK (name as AIMK-RS) to reduce the time complexity. 22 real-world datasets were applied for performance comparisons. The experimental results show AIMK and AIMK-RS outperform the current initialization methods and several well-known clustering algorithms. Specifically, AIMK-RS can significantly reduce the time complexity to O (n). Moreover, we exploit AIMK to initialize K-medoids and spectral clustering, and better performance is also explored. The above results demonstrate superior performance and good scalability by AIMK or AIMK-RS. In the future, we would like to apply AIMK to more partition-based clustering algorithms to solve real-life practical problems

    Approximate Clustering with Same-Cluster Queries

    Get PDF
    Ashtiani et al. proposed a Semi-Supervised Active Clustering framework (SSAC), where the learner is allowed to make adaptive queries to a domain expert. The queries are of the kind "do two given points belong to the same optimal cluster?", where the answers to these queries are assumed to be consistent with a unique optimal solution. There are many clustering contexts where such same cluster queries are feasible. Ashtiani et al. exhibited the power of such queries by showing that any instance of the k-means clustering problem, with additional margin assumption, can be solved efficiently if one is allowed to make O(k^2 log{k} + k log{n}) same-cluster queries. This is interesting since the k-means problem, even with the margin assumption, is NP-hard. In this paper, we extend the work of Ashtiani et al. to the approximation setting by showing that a few of such same-cluster queries enables one to get a polynomial-time (1+eps)-approximation algorithm for the k-means problem without any margin assumption on the input dataset. Again, this is interesting since the k-means problem is NP-hard to approximate within a factor (1+c) for a fixed constant 0 < c < 1. The number of same-cluster queries used by the algorithm is poly(k/eps) which is independent of the size n of the dataset. Our algorithm is based on the D^2-sampling technique, also known as the k-means++ seeding algorithm. We also give a conditional lower bound on the number of same-cluster queries showing that if the Exponential Time Hypothesis (ETH) holds, then any such efficient query algorithm needs to make Omega (k/poly log k) same-cluster queries. Our algorithm can be extended for the case where the query answers are wrong with some bounded probability. Another result we show for the k-means++ seeding is that a small modification of the k-means++ seeding within the SSAC framework converts it to a constant factor approximation algorithm instead of the well known O(log k)-approximation algorithm

    Adaptive Sampling Using Fleets of Gliders in the Presence of Fixed Buoys: a Prototype Built Upon the MyOcean Service

    Get PDF
    In the last decade the use of fleets of gliders has proven to be an effective way for sampling the ocean for long-duration missions (order of months). In a previous study [1] a method for adaptive sampling the ocean using fleets of gliders based on the use of a clustering algorithm has been introduced. The key ideas were: i) build a 2D mesh grid over the synoptic uncertainty of the ocean field to sample with “knots” having density proportional to the level of the uncertainty; ii) group this set of knots using a clustering algorithm, i.e. the Fuzzy C-Means (a fuzzy variant of the well-known K-Means algorithm). The centroids are the next way-points for the gliders. However, that method assumed all-maneuverable assets. In this study we extend it by exploiting the existence of non-maneuverable assets, i.e. fixed buoys (a situation that frequently occurs in real scenarios) and by considering time-dependent uncertainty, i.e. aiming to reach the way-points at time t such that the uncertainty at future times is minimized. The first essential idea is to consider the positions of fixed buoys as part of the centroids to obtain from the clustering algorithm: the remaining centroids to be computed will be considered as the next positions where to send each glider. By using the clustering algorithm described in [2], called “Partially Provided Centroids Fuzzy C-Means” (PPC-FCM), we have been able to exploit the presence of fixed buoys by sending the gliders in regions not already covered by the buoys/floats. This allows a better distribution (lower overlapping) of the sensing assets, with respect to the direct use of the standard Fuzzy C-Means, uninformed of the presence of the buoys. The second idea is to replace the synoptic uncertainty field by the field of mutual information between the way points at time t and a selected future time. We have built a prototype of this novel adaptive sampling scheme for mixed assets (maneuverable and non-maneuverable) that automatically retrieves ocean forecasts (currents, temperature, salinity, etc.) from MyOcean services. In addition, the prototype comes with a graphical user interfaces that facilitates the selection of the region of interest for data download. Once the data have been downloaded with low efforts, the (PPC-FCM) algorithm is run to get the next gliders way-points. The procedure is then repeated any time new forecasts are available. Our tool will be even more effective if MyOcean forecast products in future releases contain, other than the expected (mean) value of the field of interest obtained from forecasting models, a measure of the associated uncertainty, such as standard deviations. By including this uncertainty estimate, glider mission planners would have valuable information on where to send the assets in order to reduce the uncertainty as much as possible. REFERENCES [1] Cococcioni, et. al., «SONGs: Self Organizing Network of Gliders for Adaptive Sampling of the Ocean», Maritime Rapid Environmental Assessment Conference, October 18-22, Lerici, Italy, 2010 [2] Cococcioni, «Clustering in the presence of partially provided centroids: a fuzzy approach», Technical Report, Department of Information Engineering, Pisa, 2014
    • …
    corecore