232 research outputs found

    A Fast Clustering Algorithm based on pruning unnecessary distance computations in DBSCAN for High-Dimensional Data

    Get PDF
    Clustering is an important technique to deal with large scale data which are explosively created in internet. Most data are high-dimensional with a lot of noise, which brings great challenges to retrieval, classification and understanding. No current existing approach is “optimal” for large scale data. For example, DBSCAN requires O(n2) time, Fast-DBSCAN only works well in 2 dimensions, and ρ-Approximate DBSCAN runs in O(n) expected time which needs dimension D to be a relative small constant for the linear running time to hold. However, we prove theoretically and experimentally that ρ-Approximate DBSCAN degenerates to an O(n2) algorithm in very high dimension such that 2D >  > n. In this paper, we propose a novel local neighborhood searching technique, and apply it to improve DBSCAN, named as NQ-DBSCAN, such that a large number of unnecessary distance computations can be effectively reduced. Theoretical analysis and experimental results show that NQ-DBSCAN averagely runs in O(n*log(n)) with the help of indexing technique, and the best case is O(n) if proper parameters are used, which makes it suitable for many realtime data

    GriT-DBSCAN: A Spatial Clustering Algorithm for Very Large Databases

    Full text link
    DBSCAN is a fundamental spatial clustering algorithm with numerous practical applications. However, a bottleneck of the algorithm is in the worst case, the run time complexity is O(n2)O(n^2). To address this limitation, we propose a new grid-based algorithm for exact DBSCAN in Euclidean space called GriT-DBSCAN, which is based on the following two techniques. First, we introduce a grid tree to organize the non-empty grids for the purpose of efficient non-empty neighboring grids queries. Second, by utilising the spatial relationships among points, we propose a technique that iteratively prunes unnecessary distance calculations when determining whether the minimum distance between two sets is less than or equal to a certain threshold. We theoretically prove that the complexity of GriT-DBSCAN is linear to the data set size. In addition, we obtain two variants of GriT-DBSCAN by incorporating heuristics, or by combining the second technique with an existing algorithm. Experiments are conducted on both synthetic and real-world data sets to evaluate the efficiency of GriT-DBSCAN and its variants. The results of our analyses show that our algorithms outperform existing algorithms

    Theoretically-Efficient and Practical Parallel DBSCAN

    Full text link
    The DBSCAN method for spatial clustering has received significant attention due to its applicability in a variety of data analysis tasks. There are fast sequential algorithms for DBSCAN in Euclidean space that take O(nlogn)O(n\log n) work for two dimensions, sub-quadratic work for three or more dimensions, and can be computed approximately in linear work for any constant number of dimensions. However, existing parallel DBSCAN algorithms require quadratic work in the worst case, making them inefficient for large datasets. This paper bridges the gap between theory and practice of parallel DBSCAN by presenting new parallel algorithms for Euclidean exact DBSCAN and approximate DBSCAN that match the work bounds of their sequential counterparts, and are highly parallel (polylogarithmic depth). We present implementations of our algorithms along with optimizations that improve their practical performance. We perform a comprehensive experimental evaluation of our algorithms on a variety of datasets and parameter settings. Our experiments on a 36-core machine with hyper-threading show that we outperform existing parallel DBSCAN implementations by up to several orders of magnitude, and achieve speedups by up to 33x over the best sequential algorithms

    ASM-Clust: classifying functionally diverse protein families using alignment score matrices

    Get PDF
    Rapid advances in sequencing technology have resulted in the availability of genomes from organisms across the tree of life. Accurately interpreting the function of proteins in these genomes is a major challenge, as annotation transfer based on homology frequently results in misannotation and error propagation. This challenge is especially pressing for organisms whose genomes are directly obtained from environmental samples, as interpretation of their physiology and ecology is often based solely on the genome sequence. For complex protein (super)families containing a large number of sequences, classification can be used to determine whether annotation transfer is appropriate, or whether experimental evidence for function is lacking. Here we present a novel computational approach for de novo classification of large protein (super)families, based on clustering an alignment score matrix obtained by aligning all sequences in the family to a small subset of the data. We evaluate our approach on the enolase family in the Structure Function Linkage Database

    A New-Fangled FES-k-Means Clustering Algorithm for Disease Discovery and Visual Analytics

    Get PDF
    <p/> <p>The central purpose of this study is to further evaluate the quality of the performance of a new algorithm. The study provides additional evidence on this algorithm that was designed to increase the overall efficiency of the original <it>k</it>-means clustering technique&#8212;the Fast, Efficient, and Scalable <it>k</it>-means algorithm (<it>FES-k</it>-means). The <it>FES-k</it>-means algorithm uses a hybrid approach that comprises the <it>k-d</it> tree data structure that enhances the nearest neighbor query, the original <it>k</it>-means algorithm, and an adaptation rate proposed by Mashor. This algorithm was tested using two real datasets and one synthetic dataset. It was employed twice on all three datasets: once on data trained by the innovative MIL-SOM method and then on the actual untrained data in order to evaluate its competence. This two-step approach of data training prior to clustering provides a solid foundation for knowledge discovery and data mining, otherwise unclaimed by clustering methods alone. The benefits of this method are that it produces clusters similar to the original <it>k</it>-means method at a much faster rate as shown by runtime comparison data; and it provides efficient analysis of large geospatial data with implications for disease mechanism discovery. From a disease mechanism discovery perspective, it is hypothesized that the linear-like pattern of elevated blood lead levels discovered in the city of Chicago may be spatially linked to the city's water service lines.</p

    Comparison of DBSCAN and PCA-DBSCAN Algorithm for Grouping Earthquake Area

    Get PDF
    Geologically, the territory of Indonesia is where the three active tectonic plates meet which are always moving and colliding with each other, resulting in earthquakes, volcanic pathways, and faults. Earthquake is a natural disaster that cannot be avoided or prevented, but the consequences of earthquakes can be minimized. Based on data obtained from Meteorology, Climatology and Geophysics Agency (MCGA), earthquakes often occur in Indonesia. Data obtained from earthquakes can be grouped to map the area of earthquake occurrence and an analysis will be carried out to determine the characteristics of earthquake clustering areas. The clustering in this is study conducted with two experiments, first experiment is Density-Based Spatial Clustering of Applications with Noise (DBSCAN) without dimensional reduction and second experiment is DBSCAN clustering with dimensional reduction using Principal Component Analysis (PCA). The best cluster results can be found by calculating the value of Silhouette Index (SI) of each cluster. From the two experiments, the highest SI value was obtained in experiment using PCA, which was 0.4137. Then the second experiment was used as the best cluster results with the highest Dept and Magnitude features in clusters 19 and 17 which showed the 5 main regions where earthquakes often occur are Sumatra, Banda Sea, Moluccan Sea, Irian Jaya and Sulawesi Keywords— Climatology and Geophysics Agency, DBSCAN, DBSCAN-PCA, Earthquake Area, PC

    Fuzzy-Rough Intrigued Harmonic Discrepancy Clustering

    Get PDF
    corecore