Search CORE

232 research outputs found

A Fast Clustering Algorithm based on pruning unnecessary distance computations in DBSCAN for High-Dimensional Data

Author: Bouguila Nizar
Chen Yewang
Du Jixiang
Li HaiLin
Tang Shenyu
Wang Cheng
Publication venue: 'Elsevier BV'
Publication date: 05/06/2018
Field of study

Clustering is an important technique to deal with large scale data which are explosively created in internet. Most data are high-dimensional with a lot of noise, which brings great challenges to retrieval, classification and understanding. No current existing approach is “optimal” for large scale data. For example, DBSCAN requires O(n2) time, Fast-DBSCAN only works well in 2 dimensions, and ρ-Approximate DBSCAN runs in O(n) expected time which needs dimension D to be a relative small constant for the linear running time to hold. However, we prove theoretically and experimentally that ρ-Approximate DBSCAN degenerates to an O(n2) algorithm in very high dimension such that 2D >  > n. In this paper, we propose a novel local neighborhood searching technique, and apply it to improve DBSCAN, named as NQ-DBSCAN, such that a large number of unnecessary distance computations can be effectively reduced. Theoretical analysis and experimental results show that NQ-DBSCAN averagely runs in O(n*log(n)) with the help of indexing technique, and the best case is O(n) if proper parameters are used, which makes it suitable for many realtime data

Crossref

Concordia University Research Repository

GriT-DBSCAN: A Spatial Clustering Algorithm for Very Large Databases

Author: Huang Xiaogang
Liu Conan
Liu Shuangzhe
Ma Tiefeng
Publication venue
Publication date: 06/11/2022
Field of study

DBSCAN is a fundamental spatial clustering algorithm with numerous practical applications. However, a bottleneck of the algorithm is in the worst case, the run time complexity is

O(n^2)

. To address this limitation, we propose a new grid-based algorithm for exact DBSCAN in Euclidean space called GriT-DBSCAN, which is based on the following two techniques. First, we introduce a grid tree to organize the non-empty grids for the purpose of efficient non-empty neighboring grids queries. Second, by utilising the spatial relationships among points, we propose a technique that iteratively prunes unnecessary distance calculations when determining whether the minimum distance between two sets is less than or equal to a certain threshold. We theoretically prove that the complexity of GriT-DBSCAN is linear to the data set size. In addition, we obtain two variants of GriT-DBSCAN by incorporating heuristics, or by combining the second technique with an existing algorithm. Experiments are conducted on both synthetic and real-world data sets to evaluate the efficiency of GriT-DBSCAN and its variants. The results of our analyses show that our algorithms outperform existing algorithms

arXiv.org e-Print Archive

University of Canberra Research Repository

Theoretically-Efficient and Practical Parallel DBSCAN

The DBSCAN method for spatial clustering has received significant attention due to its applicability in a variety of data analysis tasks. There are fast sequential algorithms for DBSCAN in Euclidean space that take

O(n\log n)

work for two dimensions, sub-quadratic work for three or more dimensions, and can be computed approximately in linear work for any constant number of dimensions. However, existing parallel DBSCAN algorithms require quadratic work in the worst case, making them inefficient for large datasets. This paper bridges the gap between theory and practice of parallel DBSCAN by presenting new parallel algorithms for Euclidean exact DBSCAN and approximate DBSCAN that match the work bounds of their sequential counterparts, and are highly parallel (polylogarithmic depth). We present implementations of our algorithms along with optimizations that improve their practical performance. We perform a comprehensive experimental evaluation of our algorithms on a variety of datasets and parameter settings. Our experiments on a 36-core machine with hyper-threading show that we outperform existing parallel DBSCAN implementations by up to several orders of magnitude, and achieve speedups by up to 33x over the best sequential algorithms

arXiv.org e-Print Archive

Crossref

DSpace@MIT

ASM-Clust: classifying functionally diverse protein families using alignment score matrices

Author: Orphan Victoria J.
Speth Daan R.
Publication venue
Publication date: 03/10/2019
Field of study

Rapid advances in sequencing technology have resulted in the availability of genomes from organisms across the tree of life. Accurately interpreting the function of proteins in these genomes is a major challenge, as annotation transfer based on homology frequently results in misannotation and error propagation. This challenge is especially pressing for organisms whose genomes are directly obtained from environmental samples, as interpretation of their physiology and ecology is often based solely on the genome sequence. For complex protein (super)families containing a large number of sequences, classification can be used to determine whether annotation transfer is appropriate, or whether experimental evidence for function is lacking. Here we present a novel computational approach for de novo classification of large protein (super)families, based on clustering an alignment score matrix obtained by aligning all sequences in the family to a small subset of the data. We evaluate our approach on the enolase family in the Structure Function Linkage Database

Caltech Authors

A New-Fangled FES-k-Means Clustering Algorithm for Disease Discovery and Visual Analytics

Author: Oyana Tonny J
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

<p/> <p>The central purpose of this study is to further evaluate the quality of the performance of a new algorithm. The study provides additional evidence on this algorithm that was designed to increase the overall efficiency of the original <it>k</it>-means clustering technique—the Fast, Efficient, and Scalable <it>k</it>-means algorithm (<it>FES-k</it>-means). The <it>FES-k</it>-means algorithm uses a hybrid approach that comprises the <it>k-d</it> tree data structure that enhances the nearest neighbor query, the original <it>k</it>-means algorithm, and an adaptation rate proposed by Mashor. This algorithm was tested using two real datasets and one synthetic dataset. It was employed twice on all three datasets: once on data trained by the innovative MIL-SOM method and then on the actual untrained data in order to evaluate its competence. This two-step approach of data training prior to clustering provides a solid foundation for knowledge discovery and data mining, otherwise unclaimed by clustering methods alone. The benefits of this method are that it produces clusters similar to the original <it>k</it>-means method at a much faster rate as shown by runtime comparison data; and it provides efficient analysis of large geospatial data with implications for disease mechanism discovery. From a disease mechanism discovery perspective, it is hypothesized that the linear-like pattern of elevated blood lead levels discovered in the city of Chicago may be spatially linked to the city's water service lines.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Comparison of DBSCAN and PCA-DBSCAN Algorithm for Grouping Earthquake Area

Author: EMI RAHMI
Idria Maita -
Medyantiwi Rahmawita Munzir -
Mustakim -
Okfalisa -
Said Thaufik Rizaldi -
Publication venue
Publication date: 01/01/2023
Field of study

Geologically, the territory of Indonesia is where the three active tectonic plates meet which are always moving and colliding with each other, resulting in earthquakes, volcanic pathways, and faults. Earthquake is a natural disaster that cannot be avoided or prevented, but the consequences of earthquakes can be minimized. Based on data obtained from Meteorology, Climatology and Geophysics Agency (MCGA), earthquakes often occur in Indonesia. Data obtained from earthquakes can be grouped to map the area of earthquake occurrence and an analysis will be carried out to determine the characteristics of earthquake clustering areas. The clustering in this is study conducted with two experiments, first experiment is Density-Based Spatial Clustering of Applications with Noise (DBSCAN) without dimensional reduction and second experiment is DBSCAN clustering with dimensional reduction using Principal Component Analysis (PCA). The best cluster results can be found by calculating the value of Silhouette Index (SI) of each cluster. From the two experiments, the highest SI value was obtained in experiment using PCA, which was 0.4137. Then the second experiment was used as the best cluster results with the highest Dept and Magnitude features in clusters 19 and 17 which showed the 5 main regions where earthquakes often occur are Sumatra, Banda Sea, Moluccan Sea, Irian Jaya and Sulawesi Keywords— Climatology and Geophysics Agency, DBSCAN, DBSCAN-PCA, Earthquake Area, PC

Analisis Harga Pokok Produksi Rumah Pada

Fuzzy-Rough Intrigued Harmonic Discrepancy Clustering

Author: Chao Fei
Deng Ansheng
Qu Yanpeng
Shang Changjing
Shen Qiang
Yang Longzhi
Yue Guanli
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 22/02/2023
Field of study

Aberystwyth Research Portal