2,631 research outputs found

    Anytime Hierarchical Clustering

    Get PDF
    We propose a new anytime hierarchical clustering method that iteratively transforms an arbitrary initial hierarchy on the configuration of measurements along a sequence of trees we prove for a fixed data set must terminate in a chain of nested partitions that satisfies a natural homogeneity requirement. Each recursive step re-edits the tree so as to improve a local measure of cluster homogeneity that is compatible with a number of commonly used (e.g., single, average, complete) linkage functions. As an alternative to the standard batch algorithms, we present numerical evidence to suggest that appropriate adaptations of this method can yield decentralized, scalable algorithms suitable for distributed/parallel computation of clustering hierarchies and online tracking of clustering trees applicable to large, dynamically changing databases and anomaly detection.Comment: 13 pages, 6 figures, 5 tables, in preparation for submission to a conferenc

    μŠ¬λΌμ΄λ”© μœˆλ„μš°μƒμ˜ λΉ λ₯Έ 점진적 밀도 기반 ν΄λŸ¬μŠ€ν„°λ§

    Get PDF
    ν•™μœ„λ…Όλ¬Έ(박사) -- μ„œμšΈλŒ€ν•™κ΅λŒ€ν•™μ› : κ³΅κ³ΌλŒ€ν•™ 컴퓨터곡학뢀, 2022. 8. 문봉기.Given the prevalence of mobile and IoT devices, continuous clustering against streaming data has become an essential tool of increasing importance for data analytics. Among many clustering approaches, density-based clustering has garnered much attention due to its unique advantage that it can detect clusters of an arbitrary shape when noise exists. However, when the clusters need to be updated continuously along with an evolving input dataset, a relatively high computational cost is required. Particularly, deleting data points from the clusters causes severe performance degradation. In this dissertation, the performance limits of the incremental density-based clustering over sliding windows are addressed. Ultimately, two algorithms, DISC and DenForest, are proposed. The first algorithm DISC is an incremental density-based clustering algorithm that efficiently produces the same clustering results as DBSCAN over sliding windows. It focuses on redundancy issues that occur when updating clusters. When multiple data points are inserted or deleted individually, surrounding data points are explored and retrieved redundantly. DISC addresses these issues and improves the performance by updating multiple points in a batch. It also presents several optimization techniques. The second algorithm DenForest is an incremental density-based clustering algorithm that primarily focuses on the deletion process. Unlike previous methods that manage clusters as a graph, DenForest manages clusters as a group of spanning trees, which contributes to very efficient deletion performance. Moreover, it provides a batch-optimized technique to improve the insertion performance. To prove the effectiveness of the two algorithms, extensive evaluations were conducted, and it is demonstrated that DISC and DenForest outperform the state-of-the-art density-based clustering algorithms significantly.λͺ¨λ°”일 및 IoT μž₯μΉ˜κ°€ 널리 보급됨에 따라 슀트리밍 λ°μ΄ν„°μƒμ—μ„œ μ§€μ†μ μœΌλ‘œ ν΄λŸ¬μŠ€ν„°λ§ μž‘μ—…μ„ μˆ˜ν–‰ν•˜λŠ” 것은 데이터 λΆ„μ„μ—μ„œ 점점 더 μ€‘μš”ν•΄μ§€λŠ” ν•„μˆ˜ 도ꡬ가 λ˜μ—ˆμŠ΅λ‹ˆλ‹€. λ§Žμ€ ν΄λŸ¬μŠ€ν„°λ§ 방법 μ€‘μ—μ„œ 밀도 기반 ν΄λŸ¬μŠ€ν„°λ§μ€ λ…Έμ΄μ¦ˆκ°€ μ‘΄μž¬ν•  λ•Œ μž„μ˜μ˜ λͺ¨μ–‘μ˜ ν΄λŸ¬μŠ€ν„°λ₯Ό 감지할 수 μžˆλ‹€λŠ” κ³ μœ ν•œ μž₯점을 가지고 있으며 이에 따라 λ§Žμ€ 관심을 λ°›μ•˜μŠ΅λ‹ˆλ‹€. κ·ΈλŸ¬λ‚˜ 밀도 기반 ν΄λŸ¬μŠ€ν„°λ§μ€ λ³€ν™”ν•˜λŠ” μž…λ ₯ 데이터 셋에 따라 μ§€μ†μ μœΌλ‘œ ν΄λŸ¬μŠ€ν„°λ₯Ό μ—…λ°μ΄νŠΈν•΄μ•Ό ν•˜λŠ” 경우 비ꡐ적 높은 계산 λΉ„μš©μ΄ ν•„μš”ν•©λ‹ˆλ‹€. 특히, ν΄λŸ¬μŠ€ν„°μ—μ„œμ˜ 데이터 μ λ“€μ˜ μ‚­μ œλŠ” μ‹¬κ°ν•œ μ„±λŠ₯ μ €ν•˜λ₯Ό μ΄ˆλž˜ν•©λ‹ˆλ‹€. λ³Έ 박사 ν•™μœ„ λ…Όλ¬Έμ—μ„œλŠ” μŠ¬λΌμ΄λ”© μœˆλ„μš°μƒμ˜ 밀도 기반 ν΄λŸ¬μŠ€ν„°λ§μ˜ μ„±λŠ₯ ν•œκ³„λ₯Ό 닀루며 ꢁ극적으둜 두 가지 μ•Œκ³ λ¦¬μ¦˜μ„ μ œμ•ˆν•©λ‹ˆλ‹€. 첫 번째 μ•Œκ³ λ¦¬μ¦˜μΈ DISCλŠ” μŠ¬λΌμ΄λ”© μœˆλ„μš°μƒμ—μ„œ DBSCANκ³Ό λ™μΌν•œ ν΄λŸ¬μŠ€ν„°λ§ κ²°κ³Όλ₯Ό μ°ΎλŠ” 점진적 밀도 기반 ν΄λŸ¬μŠ€ν„°λ§ μ•Œκ³ λ¦¬μ¦˜μž…λ‹ˆλ‹€. ν•΄λ‹Ή μ•Œκ³ λ¦¬μ¦˜μ€ ν΄λŸ¬μŠ€ν„° μ—…λ°μ΄νŠΈ μ‹œμ— λ°œμƒν•˜λŠ” 쀑볡 λ¬Έμ œλ“€μ— μ΄ˆμ μ„ λ‘‘λ‹ˆλ‹€. 밀도 기반 ν΄λŸ¬μŠ€ν„°λ§μ—μ„œλŠ” μ—¬λŸ¬ 데이터 점듀을 κ°œλ³„μ μœΌλ‘œ μ‚½μž… ν˜Ήμ€ μ‚­μ œν•  λ•Œ μ£Όλ³€ 점듀을 λΆˆν•„μš”ν•˜κ²Œ μ€‘λ³΅μ μœΌλ‘œ νƒμƒ‰ν•˜κ³  νšŒμˆ˜ν•©λ‹ˆλ‹€. DISC λŠ” 배치 μ—…λ°μ΄νŠΈλ‘œ 이 문제λ₯Ό ν•΄κ²°ν•˜μ—¬ μ„±λŠ₯을 ν–₯μƒμ‹œν‚€λ©° μ—¬λŸ¬ μ΅œμ ν™” 방법듀을 μ œμ•ˆν•©λ‹ˆλ‹€. 두 번째 μ•Œκ³ λ¦¬μ¦˜μΈ DenForest λŠ” μ‚­μ œ 과정에 μ΄ˆμ μ„ λ‘” 점진적 밀도 기반 ν΄λŸ¬μŠ€ν„°λ§ μ•Œκ³ λ¦¬μ¦˜μž…λ‹ˆλ‹€. ν΄λŸ¬μŠ€ν„°λ₯Ό κ·Έλž˜ν”„λ‘œ κ΄€λ¦¬ν•˜λŠ” 이전 방법듀과 달리 DenForest λŠ” ν΄λŸ¬μŠ€ν„°λ₯Ό μ‹ μž₯ 트리의 그룹으둜 κ΄€λ¦¬ν•¨μœΌλ‘œμ¨ 효율적인 μ‚­μ œ μ„±λŠ₯에 κΈ°μ—¬ν•©λ‹ˆλ‹€. λ‚˜μ•„κ°€ 배치 μ΅œμ ν™” 기법을 톡해 μ‚½μž… μ„±λŠ₯ ν–₯상에도 κΈ°μ—¬ν•©λ‹ˆλ‹€. 두 μ•Œκ³ λ¦¬μ¦˜μ˜ νš¨μœ¨μ„±μ„ μž…μ¦ν•˜κΈ° μœ„ν•΄ κ΄‘λ²”μœ„ν•œ 평가λ₯Ό μˆ˜ν–‰ν•˜μ˜€μœΌλ©° DISC 및 DenForest λŠ” μ΅œμ‹ μ˜ 밀도 기반 ν΄λŸ¬μŠ€ν„°λ§ μ•Œκ³ λ¦¬μ¦˜λ“€λ³΄λ‹€ λ›°μ–΄λ‚œ μ„±λŠ₯을 λ³΄μ—¬μ£Όμ—ˆμŠ΅λ‹ˆλ‹€.1 Introduction 1 1.1 Overview of Dissertation 3 2 Related Works 7 2.1 Clustering 7 2.2 Density-Based Clustering for Static Datasets 8 2.2.1 Extension of DBSCAN 8 2.2.2 Approximation of Density-Based Clustering 9 2.2.3 Parallelization of Density-Based Clustering 10 2.3 Incremental Density-Based Clustering 10 2.3.1 Approximated Density-Based Clustering for Dynamic Datasets 11 2.4 Density-Based Clustering for Data Streams 11 2.4.1 Micro-clusters 12 2.4.2 Density-Based Clustering in Damped Window Model 12 2.4.3 Density-Based Clustering in Sliding Window Model 13 2.5 Non-Density-Based Clustering 14 2.5.1 Partitional Clustering and Hierarchical Clustering 14 2.5.2 Distribution-Based Clustering 15 2.5.3 High-Dimensional Data Clustering 15 2.5.4 Spectral Clustering 16 3 Background 17 3.1 DBSCAN 17 3.1.1 Reformulation of Density-Based Clustering 19 3.2 Incremental DBSCAN 20 3.3 Sliding Windows 22 3.3.1 Density-Based Clustering over Sliding Windows 23 3.3.2 Slow Deletion Problem 24 4 Avoiding Redundant Searches in Updating Clusters 26 4.1 The DISC Algorithm 27 4.1.1 Overview of DISC 27 4.1.2 COLLECT 29 4.1.3 CLUSTER 30 4.1.3.1 Splitting a Cluster 32 4.1.3.2 Merging Clusters 37 4.1.4 Horizontal Manner vs. Vertical Manner 38 4.2 Checking Reachability 39 4.2.1 Multi-Starter BFS 40 4.2.2 Epoch-Based Probing of R-tree Index 41 4.3 Updating Labels 43 5 Avoiding Graph Traversals in Updating Clusters 45 5.1 The DenForest Algorithm 46 5.1.1 Overview of DenForest 47 5.1.1.1 Supported Types of the Sliding Window Model 48 5.1.2 Nostalgic Core and Density-based Clusters 49 5.1.2.1 Cluster Membership of Border 51 5.1.3 DenTree 51 5.2 Operations of DenForest 54 5.2.1 Insertion 54 5.2.1.1 MST based on Link-Cut Tree 57 5.2.1.2 Time Complexity of Insert Operation 58 5.2.2 Deletion 59 5.2.2.1 Time Complexity of Delete Operation 61 5.2.3 Insertion/Deletion Examples 64 5.2.4 Cluster Membership 65 5.2.5 Batch-Optimized Update 65 5.3 Clustering Quality of DenForest 68 5.3.1 Clustering Quality for Static Data 68 5.3.2 Discussion 70 5.3.3 Replaceability 70 5.3.3.1 Nostalgic Cores and Density 71 5.3.3.2 Nostalgic Cores and Quality 72 5.3.4 1D Example 74 6 Evaluation 76 6.1 Real-World Datasets 76 6.2 Competing Methods 77 6.2.1 Exact Methods 77 6.2.2 Non-Exact Methods 77 6.3 Experimental Settings 78 6.4 Evaluation of DISC 78 6.4.1 Parameters 79 6.4.2 Baseline Evaluation 79 6.4.3 Drilled-Down Evaluation 82 6.4.3.1 Effects of Threshold Values 82 6.4.3.2 Insertions vs. Deletions 83 6.4.3.3 Range Searches 84 6.4.3.4 MS-BFS and Epoch-Based Probing 85 6.4.4 Comparison with Summarization/Approximation-Based Methods 86 6.5 Evaluation of DenForest 90 6.5.1 Parameters 90 6.5.2 Baseline Evaluation 91 6.5.3 Drilled-Down Evaluation 94 6.5.3.1 Varying Size of Window/Stride 94 6.5.3.2 Effect of Density and Distance Thresholds 95 6.5.3.3 Memory Usage 98 6.5.3.4 Clustering Quality over Sliding Windows 98 6.5.3.5 Clustering Quality under Various Density and Distance Thresholds 101 6.5.3.6 Relaxed Parameter Settings 102 6.5.4 Comparison with Summarization-Based Methods 102 7 Future Work: Extension to Varying/Relative Densities 105 8 Conclusion 107 Abstract (In Korean) 120λ°•

    Iterative Optimization and Simplification of Hierarchical Clusterings

    Full text link
    Clustering is often used for discovering structure in data. Clustering systems differ in the objective function used to evaluate clustering quality and the control strategy used to search the space of clusterings. Ideally, the search strategy should consistently construct clusterings of high quality, but be computationally inexpensive as well. In general, we cannot have it both ways, but we can partition the search so that a system inexpensively constructs a `tentative' clustering for initial examination, followed by iterative optimization, which continues to search in background for improved clusterings. Given this motivation, we evaluate an inexpensive strategy for creating initial clusterings, coupled with several control strategies for iterative optimization, each of which repeatedly modifies an initial clustering in search of a better one. One of these methods appears novel as an iterative optimization strategy in clustering contexts. Once a clustering has been constructed it is judged by analysts -- often according to task-specific criteria. Several authors have abstracted these criteria and posited a generic performance task akin to pattern completion, where the error rate over completed patterns is used to `externally' judge clustering utility. Given this performance task, we adapt resampling-based pruning strategies used by supervised learning systems to the task of simplifying hierarchical clusterings, thus promising to ease post-clustering analysis. Finally, we propose a number of objective functions, based on attribute-selection measures for decision-tree induction, that might perform well on the error rate and simplicity dimensions.Comment: See http://www.jair.org/ for any accompanying file

    Hierarchical Structures for High Dimensional Data Analysis

    Get PDF
    The volume of data is not the only problem in modern data analysis, data complexity is often more challenging. In many areas such as computational biology, topological data analysis, and machine learning, the data resides in high dimensional spaces which may not even be Euclidean. Therefore, processing such massive and complex data and extracting some useful information is a big challenge. Our methods will apply to any data sets given as a set of objects and a metric that measures the distance between them. In this dissertation, we first consider the problem of preprocessing and organizing such complex data into a hierarchical data structure that allows efficient nearest neighbor and range queries. There have been many data structures for general metric spaces, but almost all of them have construction time that can be quadratic in terms of the number of points. There are only two data structures with O(n log n) construction time, but both have very complex algorithms and analyses. Also, they cannot be implemented efficiently. Here, we present a simple, randomized incremental algorithm that builds a metric data structure in O(n log n) time in expectation. Thus, we achieve the best of both worlds, simple implementation with asymptotically optimal performance. Furthermore, we consider the close relationship between our metric data structure and point orderings used in applications such as k-center clustering. We give linear time algorithms to go back and forth between these orderings and our metric data structure. In the last part, we use metric data structures to extract topological features of a data set, such as the number of connected components, holes, and voids. We give an efficient algorithm for constructing a (1 + epsilon)-approximation to the so-called Nerve filtration of a metric space, a fundamental tool in topological data analysis
    • …
    corecore