5,719 research outputs found

    Fast k-means based on KNN Graph

    Full text link
    In the era of big data, k-means clustering has been widely adopted as a basic processing tool in various contexts. However, its computational cost could be prohibitively high as the data size and the cluster number are large. It is well known that the processing bottleneck of k-means lies in the operation of seeking closest centroid in each iteration. In this paper, a novel solution towards the scalability issue of k-means is presented. In the proposal, k-means is supported by an approximate k-nearest neighbors graph. In the k-means iteration, each data sample is only compared to clusters that its nearest neighbors reside. Since the number of nearest neighbors we consider is much less than k, the processing cost in this step becomes minor and irrelevant to k. The processing bottleneck is therefore overcome. The most interesting thing is that k-nearest neighbor graph is constructed by iteratively calling the fast kk-means itself. Comparing with existing fast k-means variants, the proposed algorithm achieves hundreds to thousands times speed-up while maintaining high clustering quality. As it is tested on 10 million 512-dimensional data, it takes only 5.2 hours to produce 1 million clusters. In contrast, to fulfill the same scale of clustering, it would take 3 years for traditional k-means

    Impact of different time series aggregation methods on optimal energy system design

    Full text link
    Modelling renewable energy systems is a computationally-demanding task due to the high fluctuation of supply and demand time series. To reduce the scale of these, this paper discusses different methods for their aggregation into typical periods. Each aggregation method is applied to a different type of energy system model, making the methods fairly incomparable. To overcome this, the different aggregation methods are first extended so that they can be applied to all types of multidimensional time series and then compared by applying them to different energy system configurations and analyzing their impact on the cost optimal design. It was found that regardless of the method, time series aggregation allows for significantly reduced computational resources. Nevertheless, averaged values lead to underestimation of the real system cost in comparison to the use of representative periods from the original time series. The aggregation method itself, e.g. k means clustering, plays a minor role. More significant is the system considered: Energy systems utilizing centralized resources require fewer typical periods for a feasible system design in comparison to systems with a higher share of renewable feed-in. Furthermore, for energy systems based on seasonal storage, currently existing models integration of typical periods is not suitable

    Ensemble rapid centroid estimation : a semi-stochastic consensus particle swarm approach for large scale cluster optimization

    Full text link
    University of Technology Sydney. Faculty of Engineering and Information Technology.This thesis details rigorous theoretical and empirical analyses on the related works in the clustering literature based on the Particle Swarm Optimization (PSO) principles. In particular, we detail the discovery of disadvantages in Van Der Merwe - Engelbrecht’s PSO clustering, Cohen - de Castro Particle Swarm Clustering (PSC), Szabo’s modified PSC (mPSC) and Szabo’s Fuzzy PSC (FPSC). We validate, both theoretically and empirically, that Van Der Merwe - Engelbrecht’s PSO clustering algorithm is not significantly better than the conventional k-means. We propose that under random initialization, the performance of their proposed algorithm diminishes exponentially as the number of classes or dimensions increase. We unravel that the PSC, mPSC, and FPSC algorithms suffer from significant complexity issues which do not translate into performance. Their cognitive and social parameters have negligible effect to convergence and the algorithms generalize to the k-means, retaining all of its characteristics including the most severe: the curse of initial position. Furthermore we observe that the three algorithms, although proposed under varying names and time frames, behave similarly to the original PSC. This thesis analyzes, both theoretically and empirically, the strengths and limitations of our proposed semi-stochastic particle swarm clustering algorithm, Rapid Centroid Estimation (RCE), self-evolutionary Ensemble RCE (ERCE), and Consensus Engrams, which are developed mainly to address the fundamental issues in PSO Clustering and the PSC families. The algorithms extend the scalability, applicability, and reliability of earlier approaches to handle large-scale non-convex cluster optimization in quasilinear complexity in both time and space. This thesis establishes the fundamentals, much surpassing those outlined in our published manuscripts

    k-Means Walk: Unveiling Operational Mechanism of a Popular Clustering Approach for Microarray Data

    Get PDF
    Since data analysis using technical computational model has profound influence on interpretation of the final results, basic understanding of the underlying model surrounding such computational tools is required for optimal experimental design by target users of such tools. Despite wide variation of techniques associated with clustering, cluster analysis has become a generic name in bioinformatics, and is seen to discover the natural grouping(s) of a set of patterns, points or sequences. The aim of this paper is to analyze k-means by applying a step-by-step k-means walk approach using graphic-guided analysis, to provide clear understanding of the operational mechanism of the k-means algorithm. Scattered graph was created using theoretical microarray gene expression data, which is a simplified view of a typical microarray experiment data. We designate the centroid as the first three initial data points and applied Euclidean distance metrics in the k-means algorithm, leading to assignment of these three data points as reference point to each cluster formation. A test is conducted to determine if there is a shift in centroid, before the next iteration is attained. We were able to trace out those data points in same cluster after convergence. We observed that, as both the dimension of data and gene list increases for hybridization matrix of microarray data, computational implementation of k-means algorithm becomes more rigorous. Furthermore, the understanding of this approach will stimulate new ideas for further development and improvement of the k-means clustering algorithm, especially within the confines of the biology of diseases and beyond. However, the major advantage will be to give improved cluster output for the interpretation of microarray experimental results, facilitate better understanding for bioinformaticians and algorithm experts, to tweak k-means algorithm for improved run-time of clustering

    Multi-camera complexity assessment system for assembly line work stations

    Get PDF
    In the last couple of years, the market demands an increasing number of product variants. This leads to an inevitable rise of the complexity in manufacturing systems. A model to quantify the complexity in a workstation has been developed, but part of the analysis is done manually. Thereto, this paper presents the results of an industrial proof-of-concept in which the possibility of automating the complexity analysis using multi camera video images, was tested
    • …
    corecore