4,573 research outputs found

    Fast k-means based on KNN Graph

    Full text link
    In the era of big data, k-means clustering has been widely adopted as a basic processing tool in various contexts. However, its computational cost could be prohibitively high as the data size and the cluster number are large. It is well known that the processing bottleneck of k-means lies in the operation of seeking closest centroid in each iteration. In this paper, a novel solution towards the scalability issue of k-means is presented. In the proposal, k-means is supported by an approximate k-nearest neighbors graph. In the k-means iteration, each data sample is only compared to clusters that its nearest neighbors reside. Since the number of nearest neighbors we consider is much less than k, the processing cost in this step becomes minor and irrelevant to k. The processing bottleneck is therefore overcome. The most interesting thing is that k-nearest neighbor graph is constructed by iteratively calling the fast kk-means itself. Comparing with existing fast k-means variants, the proposed algorithm achieves hundreds to thousands times speed-up while maintaining high clustering quality. As it is tested on 10 million 512-dimensional data, it takes only 5.2 hours to produce 1 million clusters. In contrast, to fulfill the same scale of clustering, it would take 3 years for traditional k-means

    A novel double-hybrid learning method for modal frequency-based damage assessment of bridge structures under different environmental variation patterns

    Get PDF
    Monitoring of modal frequencies under an unsupervised learning framework is a practical strategy for damage assessment of civil structures, especially bridges. However, the key challenge is related to high sensitivity of modal frequencies to environmental and/or operational changes that may lead to economic and safety losses. The other challenge pertains to different environmental and/or operational variation patterns in modal frequencies due to differences in structural types, materials, and applications, measurement periods in terms of short and/or long monitoring programs, geographical locations of structures, weather conditions, and influences of single or multiple environmental and/or operational factors, which may cause barriers to employing stateof-the-art unsupervised learning approaches. To cope with these issues, this paper proposes a novel double-hybrid learning technique in an unsupervised manner. It contains two stages of data partitioning and anomaly detection, both of which comprise two hybrid algorithms. For the first stage, an improved hybrid clustering method based on a coupling of shared nearest neighbor searching and density peaks clustering is proposed to prepare local information for anomaly detection with the focus on mitigating environmental and/or operational effects. For the second stage, this paper proposes an innovative non-parametric hybrid anomaly detector based on local outlier factor. In both stages, the number of nearest neighbors is the key hyperparameter that is automatically determined by leveraging a self-adaptive neighbor searching algorithm. Modal frequencies of two full-scale bridges are utilized to validate the proposed technique with several comparisons. Results indicate that this technique is able to successfully eliminate different environmental and/or operational variations and correctly detect damage

    dbscan: Fast Density-Based Clustering with R

    Get PDF
    This article describes the implementation and use of the R package dbscan, which provides complete and fast implementations of the popular density-based clustering algorithm DBSCAN and the augmented ordering algorithm OPTICS. Package dbscan uses advanced open-source spatial indexing data structures implemented in C++ to speed up computation. An important advantage of this implementation is that it is up-to-date with several improvements that have been added since the original algorithms were publications (e.g., artifact corrections and dendrogram extraction methods for OPTICS). We provide a consistent presentation of the DBSCAN and OPTICS algorithms, and compare dbscan's implementation with other popular libraries such as the R package fpc, ELKI, WEKA, PyClustering, SciKit-Learn, and SPMF in terms of available features and using an experimental comparison

    Extended fast search clustering algorithm: widely density clusters, no density peaks

    Full text link
    CFSFDP (clustering by fast search and find of density peaks) is recently developed density-based clustering algorithm. Compared to DBSCAN, it needs less parameters and is computationally cheap for its non-iteration. Alex. at al have demonstrated its power by many applications. However, CFSFDP performs not well when there are more than one density peak for one cluster, what we name as "no density peaks". In this paper, inspired by the idea of a hierarchical clustering algorithm CHAMELEON, we propose an extension of CFSFDP,E_CFSFDP, to adapt more applications. In particular, we take use of original CFSFDP to generating initial clusters first, then merge the sub clusters in the second phase. We have conducted the algorithm to several data sets, of which, there are "no density peaks". Experiment results show that our approach outperforms the original one due to it breaks through the strict claim of data sets.Comment: 18 pages, 10 figures, DBDM 201

    Data clustering using a model granular magnet

    Full text link
    We present a new approach to clustering, based on the physical properties of an inhomogeneous ferromagnet. No assumption is made regarding the underlying distribution of the data. We assign a Potts spin to each data point and introduce an interaction between neighboring points, whose strength is a decreasing function of the distance between the neighbors. This magnetic system exhibits three phases. At very low temperatures it is completely ordered; all spins are aligned. At very high temperatures the system does not exhibit any ordering and in an intermediate regime clusters of relatively strongly coupled spins become ordered, whereas different clusters remain uncorrelated. This intermediate phase is identified by a jump in the order parameters. The spin-spin correlation function is used to partition the spins and the corresponding data points into clusters. We demonstrate on three synthetic and three real data sets how the method works. Detailed comparison to the performance of other techniques clearly indicates the relative success of our method.Comment: 46 pages, postscript, 15 ps figures include
    • …
    corecore