5,719 research outputs found
Fast k-means based on KNN Graph
In the era of big data, k-means clustering has been widely adopted as a basic
processing tool in various contexts. However, its computational cost could be
prohibitively high as the data size and the cluster number are large. It is
well known that the processing bottleneck of k-means lies in the operation of
seeking closest centroid in each iteration. In this paper, a novel solution
towards the scalability issue of k-means is presented. In the proposal, k-means
is supported by an approximate k-nearest neighbors graph. In the k-means
iteration, each data sample is only compared to clusters that its nearest
neighbors reside. Since the number of nearest neighbors we consider is much
less than k, the processing cost in this step becomes minor and irrelevant to
k. The processing bottleneck is therefore overcome. The most interesting thing
is that k-nearest neighbor graph is constructed by iteratively calling the fast
-means itself. Comparing with existing fast k-means variants, the proposed
algorithm achieves hundreds to thousands times speed-up while maintaining high
clustering quality. As it is tested on 10 million 512-dimensional data, it
takes only 5.2 hours to produce 1 million clusters. In contrast, to fulfill the
same scale of clustering, it would take 3 years for traditional k-means
Impact of different time series aggregation methods on optimal energy system design
Modelling renewable energy systems is a computationally-demanding task due to
the high fluctuation of supply and demand time series. To reduce the scale of
these, this paper discusses different methods for their aggregation into
typical periods. Each aggregation method is applied to a different type of
energy system model, making the methods fairly incomparable. To overcome this,
the different aggregation methods are first extended so that they can be
applied to all types of multidimensional time series and then compared by
applying them to different energy system configurations and analyzing their
impact on the cost optimal design. It was found that regardless of the method,
time series aggregation allows for significantly reduced computational
resources. Nevertheless, averaged values lead to underestimation of the real
system cost in comparison to the use of representative periods from the
original time series. The aggregation method itself, e.g. k means clustering,
plays a minor role. More significant is the system considered: Energy systems
utilizing centralized resources require fewer typical periods for a feasible
system design in comparison to systems with a higher share of renewable
feed-in. Furthermore, for energy systems based on seasonal storage, currently
existing models integration of typical periods is not suitable
Ensemble rapid centroid estimation : a semi-stochastic consensus particle swarm approach for large scale cluster optimization
University of Technology Sydney. Faculty of Engineering and Information Technology.This thesis details rigorous theoretical and empirical analyses on the related works in the clustering literature based on the Particle Swarm Optimization (PSO) principles. In particular, we detail the discovery of disadvantages in Van Der Merwe - Engelbrecht’s PSO clustering, Cohen - de Castro Particle Swarm Clustering (PSC), Szabo’s modified PSC (mPSC) and Szabo’s Fuzzy PSC (FPSC).
We validate, both theoretically and empirically, that Van Der Merwe - Engelbrecht’s PSO clustering algorithm is not significantly better than the conventional k-means. We propose that under random initialization, the performance of their proposed algorithm diminishes exponentially as the number of classes or dimensions increase.
We unravel that the PSC, mPSC, and FPSC algorithms suffer from significant complexity issues which do not translate into performance. Their cognitive and social parameters have negligible effect to convergence and the algorithms generalize to the k-means, retaining all of its characteristics including the most severe: the curse of initial position. Furthermore we observe that the three algorithms, although proposed under varying names and time frames, behave similarly to the original PSC.
This thesis analyzes, both theoretically and empirically, the strengths and limitations of our proposed semi-stochastic particle swarm clustering algorithm, Rapid Centroid Estimation (RCE), self-evolutionary Ensemble RCE (ERCE), and Consensus Engrams, which are developed mainly to address the fundamental issues in PSO Clustering and the PSC families. The algorithms extend the scalability, applicability, and reliability of earlier approaches to handle large-scale non-convex cluster optimization in quasilinear complexity in both time and space. This thesis establishes the fundamentals, much surpassing those outlined in our published manuscripts
Recommended from our members
Micromobility evolution and expansion: Understanding how docked and dockless bikesharing models complement and compete – A case study of San Francisco
Shared micromobility – the shared use of bicycles, scooters, or other low-speed modes – is an innovative transportation strategy growing across the United States that includes various service models such as docked, dockless, and e-bike service models. This research focuses on understanding how docked bikesharing and dockless e-bikesharing models complement and compete with respect to user travel behaviors. To inform our analysis, we used two datasets from February 2018 of Ford GoBike (docked) and JUMP (dockless electric) bikesharing trips in San Francisco. We employed three methodological approaches: 1) travel behavior analysis, 2) discrete choice analysis with a destination choice model, and 3) geospatial suitability analysis based on the Spatial Temporal Economic Physiological Social (STEPS) to Transportation Equity framework. We found that dockless e-bikesharing trips were longer in distance and duration than docked trips. The average JUMP trip was about a third longer in distance and about twice as long in duration than the average GoBike trip. JUMP users were far less sensitive to estimated total elevation gain than were GoBike users, making trips with total elevation gain about three times larger than those of GoBike users, on average. The JUMP system achieved greater usage rates than GoBike, with 0.8 more daily trips per bike and 2.3 more miles traveled on each bike per day, on average. The destination choice model results suggest that JUMP users traveled to lower-density destinations, and GoBike users were largely traveling to dense employment areas. Bike rack density was a significant positive factor for JUMP users. The location of GoBike docking stations may attract users and/or be well-placed to the destination preferences of users. The STEPS-based bikeability analysis revealed opportunities for the expansion of both bikesharing systems in areas of the city where high-job density and bike facility availability converge with older resident populations
k-Means Walk: Unveiling Operational Mechanism of a Popular Clustering Approach for Microarray Data
Since data analysis using technical computational model has profound influence on interpretation of the final results, basic understanding of the underlying model surrounding such computational tools is required for optimal experimental design by target users of such tools. Despite wide variation of techniques associated with clustering, cluster analysis has become a generic name in bioinformatics, and is seen to discover the natural grouping(s) of a set of patterns, points or sequences. The aim of this paper is to analyze k-means by applying a step-by-step k-means walk approach using graphic-guided analysis, to provide clear understanding of the operational mechanism of the k-means algorithm. Scattered graph was created using theoretical microarray gene expression data, which is a simplified view of a typical microarray experiment data. We designate the centroid as the first three initial data points and applied Euclidean distance metrics in the k-means algorithm, leading to assignment of these three data points as reference point to each cluster formation. A test is conducted to determine if there is a shift in centroid, before the next iteration is attained. We were able to trace out those data points in same cluster after convergence. We observed that, as both the dimension of data and gene list increases for hybridization matrix of microarray data, computational implementation of k-means algorithm becomes more rigorous. Furthermore, the understanding of this approach will stimulate new ideas for further development and improvement of the k-means clustering algorithm, especially within the confines of the biology of diseases and beyond. However, the major advantage will be to give improved cluster output for the interpretation of microarray experimental results, facilitate better understanding for bioinformaticians and algorithm experts, to tweak k-means algorithm for improved run-time of clustering
Multi-camera complexity assessment system for assembly line work stations
In the last couple of years, the market demands an increasing number of product variants. This leads to an inevitable rise of the complexity in manufacturing systems. A model to quantify the complexity in a workstation has been developed, but part of the analysis is done manually. Thereto, this paper presents the results of an industrial proof-of-concept in which the possibility of automating the complexity analysis using multi camera video images, was tested
- …