15,915 research outputs found

    Fast k-means based on KNN Graph

    Full text link
    In the era of big data, k-means clustering has been widely adopted as a basic processing tool in various contexts. However, its computational cost could be prohibitively high as the data size and the cluster number are large. It is well known that the processing bottleneck of k-means lies in the operation of seeking closest centroid in each iteration. In this paper, a novel solution towards the scalability issue of k-means is presented. In the proposal, k-means is supported by an approximate k-nearest neighbors graph. In the k-means iteration, each data sample is only compared to clusters that its nearest neighbors reside. Since the number of nearest neighbors we consider is much less than k, the processing cost in this step becomes minor and irrelevant to k. The processing bottleneck is therefore overcome. The most interesting thing is that k-nearest neighbor graph is constructed by iteratively calling the fast kk-means itself. Comparing with existing fast k-means variants, the proposed algorithm achieves hundreds to thousands times speed-up while maintaining high clustering quality. As it is tested on 10 million 512-dimensional data, it takes only 5.2 hours to produce 1 million clusters. In contrast, to fulfill the same scale of clustering, it would take 3 years for traditional k-means

    Approximated and User Steerable tSNE for Progressive Visual Analytics

    Full text link
    Progressive Visual Analytics aims at improving the interactivity in existing analytics techniques by means of visualization as well as interaction with intermediate results. One key method for data analysis is dimensionality reduction, for example, to produce 2D embeddings that can be visualized and analyzed efficiently. t-Distributed Stochastic Neighbor Embedding (tSNE) is a well-suited technique for the visualization of several high-dimensional data. tSNE can create meaningful intermediate results but suffers from a slow initialization that constrains its application in Progressive Visual Analytics. We introduce a controllable tSNE approximation (A-tSNE), which trades off speed and accuracy, to enable interactive data exploration. We offer real-time visualization techniques, including a density-based solution and a Magic Lens to inspect the degree of approximation. With this feedback, the user can decide on local refinements and steer the approximation level during the analysis. We demonstrate our technique with several datasets, in a real-world research scenario and for the real-time analysis of high-dimensional streams to illustrate its effectiveness for interactive data analysis
    • …
    corecore