1,498 research outputs found

    SOTXTSTREAM: Density-based self-organizing clustering of text streams

    Get PDF
    A streaming data clustering algorithm is presented building upon the density-based selforganizing stream clustering algorithm SOSTREAM. Many density-based clustering algorithms are limited by their inability to identify clusters with heterogeneous density. SOSTREAM addresses this limitation through the use of local (nearest neighbor-based) density determinations. Additionally, many stream clustering algorithms use a two-phase clustering approach. In the first phase, a micro-clustering solution is maintained online, while in the second phase, the micro-clustering solution is clustered offline to produce a macro solution. By performing self-organization techniques on micro-clusters in the online phase, SOSTREAM is able to maintain a macro clustering solution in a single phase. Leveraging concepts from SOSTREAM, a new density-based self-organizing text stream clustering algorithm, SOTXTSTREAM, is presented that addresses several shortcomings of SOSTREAM. Gains in clustering performance of this new algorithm are demonstrated on several real-world text stream datasets

    Density-based algorithms for active and anytime clustering

    Get PDF
    Data intensive applications like biology, medicine, and neuroscience require effective and efficient data mining technologies. Advanced data acquisition methods produce a constantly increasing volume and complexity. As a consequence, the need of new data mining technologies to deal with complex data has emerged during the last decades. In this thesis, we focus on the data mining task of clustering in which objects are separated in different groups (clusters) such that objects inside a cluster are more similar than objects in different clusters. Particularly, we consider density-based clustering algorithms and their applications in biomedicine. The core idea of the density-based clustering algorithm DBSCAN is that each object within a cluster must have a certain number of other objects inside its neighborhood. Compared with other clustering algorithms, DBSCAN has many attractive benefits, e.g., it can detect clusters with arbitrary shape and is robust to outliers, etc. Thus, DBSCAN has attracted a lot of research interest during the last decades with many extensions and applications. In the first part of this thesis, we aim at developing new algorithms based on the DBSCAN paradigm to deal with the new challenges of complex data, particularly expensive distance measures and incomplete availability of the distance matrix. Like many other clustering algorithms, DBSCAN suffers from poor performance when facing expensive distance measures for complex data. To tackle this problem, we propose a new algorithm based on the DBSCAN paradigm, called Anytime Density-based Clustering (A-DBSCAN), that works in an anytime scheme: in contrast to the original batch scheme of DBSCAN, the algorithm A-DBSCAN first produces a quick approximation of the clustering result and then continuously refines the result during the further run. Experts can interrupt the algorithm, examine the results, and choose between (1) stopping the algorithm at any time whenever they are satisfied with the result to save runtime and (2) continuing the algorithm to achieve better results. Such kind of anytime scheme has been proven in the literature as a very useful technique when dealing with time consuming problems. We also introduced an extended version of A-DBSCAN called A-DBSCAN-XS which is more efficient and effective than A-DBSCAN when dealing with expensive distance measures. Since DBSCAN relies on the cardinality of the neighborhood of objects, it requires the full distance matrix to perform. For complex data, these distances are usually expensive, time consuming or even impossible to acquire due to high cost, high time complexity, noisy and missing data, etc. Motivated by these potential difficulties of acquiring the distances among objects, we propose another approach for DBSCAN, called Active Density-based Clustering (Act-DBSCAN). Given a budget limitation B, Act-DBSCAN is only allowed to use up to B pairwise distances ideally to produce the same result as if it has the entire distance matrix at hand. The general idea of Act-DBSCAN is that it actively selects the most promising pairs of objects to calculate the distances between them and tries to approximate as much as possible the desired clustering result with each distance calculation. This scheme provides an efficient way to reduce the total cost needed to perform the clustering. Thus it limits the potential weakness of DBSCAN when dealing with the distance sparseness problem of complex data. As a fundamental data clustering algorithm, density-based clustering has many applications in diverse fields. In the second part of this thesis, we focus on an application of density-based clustering in neuroscience: the segmentation of the white matter fiber tracts in human brain acquired from Diffusion Tensor Imaging (DTI). We propose a model to evaluate the similarity between two fibers as a combination of structural similarity and connectivity-related similarity of fiber tracts. Various distance measure techniques from fields like time-sequence mining are adapted to calculate the structural similarity of fibers. Density-based clustering is used as the segmentation algorithm. We show how A-DBSCAN and A-DBSCAN-XS are used as novel solutions for the segmentation of massive fiber datasets and provide unique features to assist experts during the fiber segmentation process.Datenintensive Anwendungen wie Biologie, Medizin und Neurowissenschaften erfordern effektive und effiziente Data-Mining-Technologien. Erweiterte Methoden der Datenerfassung erzeugen stetig wachsende Datenmengen und Komplexit\"at. In den letzten Jahrzehnten hat sich daher ein Bedarf an neuen Data-Mining-Technologien f\"ur komplexe Daten ergeben. In dieser Arbeit konzentrieren wir uns auf die Data-Mining-Aufgabe des Clusterings, in der Objekte in verschiedenen Gruppen (Cluster) getrennt werden, so dass Objekte in einem Cluster untereinander viel \"ahnlicher sind als Objekte in verschiedenen Clustern. Insbesondere betrachten wir dichtebasierte Clustering-Algorithmen und ihre Anwendungen in der Biomedizin. Der Kerngedanke des dichtebasierten Clustering-Algorithmus DBSCAN ist, dass jedes Objekt in einem Cluster eine bestimmte Anzahl von anderen Objekten in seiner Nachbarschaft haben muss. Im Vergleich mit anderen Clustering-Algorithmen hat DBSCAN viele attraktive Vorteile, zum Beispiel kann es Cluster mit beliebiger Form erkennen und ist robust gegen\"uber Ausrei{\ss}ern. So hat DBSCAN in den letzten Jahrzehnten gro{\ss}es Forschungsinteresse mit vielen Erweiterungen und Anwendungen auf sich gezogen. Im ersten Teil dieser Arbeit wollen wir auf die Entwicklung neuer Algorithmen eingehen, die auf dem DBSCAN Paradigma basieren, um mit den neuen Herausforderungen der komplexen Daten, insbesondere teurer Abstandsma{\ss}e und unvollst\"andiger Verf\"ugbarkeit der Distanzmatrix umzugehen. Wie viele andere Clustering-Algorithmen leidet DBSCAN an schlechter Per- formanz, wenn es teuren Abstandsma{\ss}en f\"ur komplexe Daten gegen\"uber steht. Um dieses Problem zu l\"osen, schlagen wir einen neuen Algorithmus vor, der auf dem DBSCAN Paradigma basiert, genannt Anytime Density-based Clustering (A-DBSCAN), der mit einem Anytime Schema funktioniert. Im Gegensatz zu dem urspr\"unglichen Schema DBSCAN, erzeugt der Algorithmus A-DBSCAN zuerst eine schnelle Ann\"aherung des Clusterings-Ergebnisses und verfeinert dann kontinuierlich das Ergebnis im weiteren Verlauf. Experten k\"onnen den Algorithmus unterbrechen, die Ergebnisse pr\"ufen und w\"ahlen zwischen (1) Anhalten des Algorithmus zu jeder Zeit, wann immer sie mit dem Ergebnis zufrieden sind, um Laufzeit sparen und (2) Fortsetzen des Algorithmus, um bessere Ergebnisse zu erzielen. Eine solche Art eines "Anytime Schemas" ist in der Literatur als eine sehr n\"utzliche Technik erprobt, wenn zeitaufwendige Problemen anfallen. Wir stellen auch eine erweiterte Version von A-DBSCAN als A-DBSCAN-XS vor, die effizienter und effektiver als A-DBSCAN beim Umgang mit teuren Abstandsma{\ss}en ist. Da DBSCAN auf der Kardinalit\"at der Nachbarschaftsobjekte beruht, ist es notwendig, die volle Distanzmatrix auszurechen. F\"ur komplexe Daten sind diese Distanzen in der Regel teuer, zeitaufwendig oder sogar unm\"oglich zu errechnen, aufgrund der hohen Kosten, einer hohen Zeitkomplexit\"at oder verrauschten und fehlende Daten. Motiviert durch diese m\"oglichen Schwierigkeiten der Berechnung von Entfernungen zwischen Objekten, schlagen wir einen anderen Ansatz f\"ur DBSCAN vor, namentlich Active Density-based Clustering (Act-DBSCAN). Bei einer Budgetbegrenzung B, darf Act-DBSCAN nur bis zu B ideale paarweise Distanzen verwenden, um das gleiche Ergebnis zu produzieren, wie wenn es die gesamte Distanzmatrix zur Hand h\"atte. Die allgemeine Idee von Act-DBSCAN ist, dass es aktiv die erfolgversprechendsten Paare von Objekten w\"ahlt, um die Abst\"ande zwischen ihnen zu berechnen, und versucht, sich so viel wie m\"oglich dem gew\"unschten Clustering mit jeder Abstandsberechnung zu n\"ahern. Dieses Schema bietet eine effiziente M\"oglichkeit, die Gesamtkosten der Durchf\"uhrung des Clusterings zu reduzieren. So schr\"ankt sie die potenzielle Schw\"ache des DBSCAN beim Umgang mit dem Distance Sparseness Problem von komplexen Daten ein. Als fundamentaler Clustering-Algorithmus, hat dichte-basiertes Clustering viele Anwendungen in den unterschiedlichen Bereichen. Im zweiten Teil dieser Arbeit konzentrieren wir uns auf eine Anwendung des dichte-basierten Clusterings in den Neurowissenschaften: Die Segmentierung der wei{\ss}en Substanz bei Faserbahnen im menschlichen Gehirn, die vom Diffusion Tensor Imaging (DTI) erfasst werden. Wir schlagen ein Modell vor, um die \"Ahnlichkeit zwischen zwei Fasern als einer Kombination von struktureller und konnektivit\"atsbezogener \"Ahnlichkeit von Faserbahnen zu beurteilen. Verschiedene Abstandsma{\ss}e aus Bereichen wie dem Time-Sequence Mining werden angepasst, um die strukturelle \"Ahnlichkeit von Fasern zu berechnen. Dichte-basiertes Clustering wird als Segmentierungsalgorithmus verwendet. Wir zeigen, wie A-DBSCAN und A-DBSCAN-XS als neuartige L\"osungen f\"ur die Segmentierung von sehr gro{\ss}en Faserdatens\"atzen verwendet werden, und bieten innovative Funktionen, um Experten w\"ahrend des Fasersegmentierungsprozesses zu unterst\"utzen

    Coping With New Challengens for Density-Based Clustering

    Get PDF
    Knowledge Discovery in Databases (KDD) is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. The core step of the KDD process is the application of a Data Mining algorithm in order to produce a particular enumeration of patterns and relationships in large databases. Clustering is one of the major data mining tasks and aims at grouping the data objects into meaningful classes (clusters) such that the similarity of objects within clusters is maximized, and the similarity of objects from different clusters is minimized. Beside many others, the density-based clustering notion underlying the algorithm DBSCAN and its hierarchical extension OPTICS has been proposed recently, being one of the most successful approaches to clustering. In this thesis, our aim is to advance the state-of-the-art clustering, especially density-based clustering by identifying novel challenges for density-based clustering and proposing innovative and solid solutions for these challenges. We describe the development of the industrial prototype BOSS (Browsing OPTICS plots for Similarity Search) which is a first step towards developing a comprehensive, scalable and distributed computing solution designed to make the efficiency and analytical capabilities of OPTICS available to a broader audience. For the development of BOSS, several key enhancements of OPTICS are required which are addressed in this thesis. We develop incremental algorithms of OPTICS to efficiently reconstruct the hierarchical clustering structure in frequently updated databases, in particular, when a set of objects is inserted in or deleted from the database. We empirically show that these incremental algorithms yield significant speed-up factors over the original OPTICS algorithm. Furthermore, we propose a novel algorithm for automatic extraction of clusters from hierarchical clustering representations that outperforms comparative methods, and introduce two novel approaches for selecting meaningful representatives, using the density-based concepts of OPTICS and producing better results than the related medoid approach. Another major challenge for density-based clustering is to cope with high dimensional data. Many today's real-world data sets contain a large number of measurements (or features) for a single data object. Usually, global feature reduction techniques cannot be applied to these data sets. Thus, the task of feature selection must be combined with and incooperated into the clustering process. In this thesis, we present original extensions and enhancements of the density-based clustering notion to cope with high dimensional data. In particular, we propose an algorithm called SUBCLU (density based SUBspace CLUstering) that extends DBSCAN to the problem of subspace clustering. SUBCLU efficiently computes all clusters that would have been found if DBSCAN is applied to all possible subspaces of the feature space. An experimental evaluation on real-world data sets illustrates that SUBCLU is more effective than existing subspace clustering algorithms because it is able to find clusters of arbitrary size and shape, and produces determine results. A semi-hierarchical extension of SUBCLU called RIS (Ranking Interesting Subspaces) is proposed that does not compute the subspace clusters directly, but generates a list of subspaces ranked by their clustering characteristics. A hierarchical clustering algorithm can be applied to these interesting subspaces in order to compute a hierarchical (subspace) clustering. A comparative evaluation of RIS and SUBCLU shows that RIS in combination with OPTICS can achieve an information gain over SUBCLU. In addition, we propose the algorithm 4C (Computing Correlation Connected Clusters) that extends the concepts of DBSCAN to compute density-based correlation clusters. 4C benefits from an innovative, well-defined and effective clustering model, outperforming related approaches in terms of clustering quality on real-world data sets.Knowledge Discovery in Databases (KDD) ist der Prozess der (semi-)automatischen Extraktion von Wissen aus Datenbanken, das gültig, bisher unbekannt und potentiell nützlich für eine gegebene Anwendung ist. Der zentrale Schritt des KDD-Prozesses ist das Data Mining. Eine der wichtigsten Aufgaben des Data Mining ist Clustering. Dabei sollen die Objekte einer Datenbank in Gruppen (Cluster) partitioniert werden, so dass Objekte eines Clusters möglichst ähnlich und Objekte verschiedener Cluster möglichst unähnlich zu einander sind. Das dichtebasierte Clustermodell und die darauf aufbauenden Algorithmen DBSCAN und OPTICS sind unter einer Vielzahl anderer Clustering-Ansätze eine der erfolgreichsten Methoden zum Clustering. Im Rahmen dieser Dissertation wollen wir den aktuellen Stand der Technik im Bereich Clustering und speziell im Bereich dichtebasiertes Clustering voranbringen. Dazu erarbeiten wir neue Herausforderungen für das dichtebasierte Clustermodell und schlagen dazu innovative Lösungen vor. Zunächst steht die Entwicklung des industriellen Prototyps BOSS (Browsing OPTICS plots for Similarity Search) im Mittelpunkt dieser Arbeit. BOSS ist ein erster Beitrag zu einer umfassenden, skalierbaren und verteilten Softwarelösung, die eine Nutzung der Effizienzvorteile und die analytischen Möglichkeiten des dichtebasierten, hierarchischen Clustering-Algorithmus OPTICS für ein breites Publikum ermöglichen. Zur Entwicklung von BOSS werden drei entscheidende Erweiterungen von OPTICS benötigt: Wir entwickeln eine inkrementelle Version von OPTICS um nach einem Update der Datenbank (Einfügen/Löschen einer Menge von Objekten) die hierarchische Clustering Struktur effizient zu reorganisieren. Anhand von Experimenten mit synthetischen und realen Daten zeigen wir, dass die vorgeschlagenen, inkrementellen Algorithmen deutliche Beschleunigungsfaktoren gegenüber dem originalen OPTICS-Algorithmus erzielen. Desweiteren schlagen wir einen neuen Algorithmus zur automatischen Clusterextraktion aus hierarchischen Repräsentationen und zwei innovative Methoden zur automatischen Auswahl geeigneter Clusterrepräsentaten vor. Unsere neuen Techniken erzielen bei Tests auf mehreren realen Datenbanken im Vergleich zu den konkurrierenden Verfahren bessere Ergebnisse. Eine weitere Herausforderung für Clustering-Verfahren stellen hochdimensionale Featureräume dar. Reale Datensätze beinhalten dank moderner Verfahren zur Datenerhebung häufig sehr viele Merkmale. Teile dieser Merkmale unterliegen oft Rauschen oder Abhängigkeiten und können meist nicht im Vorfeld ausgesiebt werden, da diese Effekte jeweils in Teilen der Datenbank unterschiedlich ausgeprägt sind. Daher muss die Wahl der Features mit dem Data-Mining-Verfahren verknüpft werden. Im Rahmen dieser Arbeit stellen wir innovative Erweiterungen des dichtebasierten Clustermodells für hochdimensionale Daten vor. Wir entwickeln SUBCLU (dichtebasiertes SUBspace CLUstering), ein auf DBSCAN basierender Subspace Clustering Algorithmus. SUBCLU erzeugt effizient alle Cluster, die gefunden werden, wenn man DBSCAN auf alle möglichen Teilräume des Datensatzes anwendet. Experimente auf realen Daten zeigen, dass SUBCLU effektiver als vergleichbare Algorithmen ist. RIS (Ranking Interesting Subspaces), eine semi-hierarchische Erweiterung von SUBCLU, wird vorgeschlagen, das nicht mehr direkt die Teilraumcluster berechnet, sondern eine Liste von Teilräumen geordnet anhand ihrer Clustering-Qualität erzeugt. Dadurch können hierarchische Partitionierungen auf ausgewählten Teilräumen erzeugt werden. Experimente belegen, dass RIS in Kombination mit OPTICS ein Informationsgewinn gegenüber SUBCLU erreicht. Außerdem stellen wir den neuartigen Korrelationscluster Algorithmus 4C (Computing Correlation Connected Clusters) vor. 4C basiert auf einem innovativen und wohldefinierten Clustermodell und erzielt in unseren Experimenten mit realen Daten bessere Ergebnisse als vergleichbare Clustering-Ansätze

    슬라이딩 윈도우상의 빠른 점진적 밀도 기반 클러스터링

    Get PDF
    학위논문(박사) -- 서울대학교대학원 : 공과대학 컴퓨터공학부, 2022. 8. 문봉기.Given the prevalence of mobile and IoT devices, continuous clustering against streaming data has become an essential tool of increasing importance for data analytics. Among many clustering approaches, density-based clustering has garnered much attention due to its unique advantage that it can detect clusters of an arbitrary shape when noise exists. However, when the clusters need to be updated continuously along with an evolving input dataset, a relatively high computational cost is required. Particularly, deleting data points from the clusters causes severe performance degradation. In this dissertation, the performance limits of the incremental density-based clustering over sliding windows are addressed. Ultimately, two algorithms, DISC and DenForest, are proposed. The first algorithm DISC is an incremental density-based clustering algorithm that efficiently produces the same clustering results as DBSCAN over sliding windows. It focuses on redundancy issues that occur when updating clusters. When multiple data points are inserted or deleted individually, surrounding data points are explored and retrieved redundantly. DISC addresses these issues and improves the performance by updating multiple points in a batch. It also presents several optimization techniques. The second algorithm DenForest is an incremental density-based clustering algorithm that primarily focuses on the deletion process. Unlike previous methods that manage clusters as a graph, DenForest manages clusters as a group of spanning trees, which contributes to very efficient deletion performance. Moreover, it provides a batch-optimized technique to improve the insertion performance. To prove the effectiveness of the two algorithms, extensive evaluations were conducted, and it is demonstrated that DISC and DenForest outperform the state-of-the-art density-based clustering algorithms significantly.모바일 및 IoT 장치가 널리 보급됨에 따라 스트리밍 데이터상에서 지속적으로 클러스터링 작업을 수행하는 것은 데이터 분석에서 점점 더 중요해지는 필수 도구가 되었습니다. 많은 클러스터링 방법 중에서 밀도 기반 클러스터링은 노이즈가 존재할 때 임의의 모양의 클러스터를 감지할 수 있다는 고유한 장점을 가지고 있으며 이에 따라 많은 관심을 받았습니다. 그러나 밀도 기반 클러스터링은 변화하는 입력 데이터 셋에 따라 지속적으로 클러스터를 업데이트해야 하는 경우 비교적 높은 계산 비용이 필요합니다. 특히, 클러스터에서의 데이터 점들의 삭제는 심각한 성능 저하를 초래합니다. 본 박사 학위 논문에서는 슬라이딩 윈도우상의 밀도 기반 클러스터링의 성능 한계를 다루며 궁극적으로 두 가지 알고리즘을 제안합니다. 첫 번째 알고리즘인 DISC는 슬라이딩 윈도우상에서 DBSCAN과 동일한 클러스터링 결과를 찾는 점진적 밀도 기반 클러스터링 알고리즘입니다. 해당 알고리즘은 클러스터 업데이트 시에 발생하는 중복 문제들에 초점을 둡니다. 밀도 기반 클러스터링에서는 여러 데이터 점들을 개별적으로 삽입 혹은 삭제할 때 주변 점들을 불필요하게 중복적으로 탐색하고 회수합니다. DISC 는 배치 업데이트로 이 문제를 해결하여 성능을 향상시키며 여러 최적화 방법들을 제안합니다. 두 번째 알고리즘인 DenForest 는 삭제 과정에 초점을 둔 점진적 밀도 기반 클러스터링 알고리즘입니다. 클러스터를 그래프로 관리하는 이전 방법들과 달리 DenForest 는 클러스터를 신장 트리의 그룹으로 관리함으로써 효율적인 삭제 성능에 기여합니다. 나아가 배치 최적화 기법을 통해 삽입 성능 향상에도 기여합니다. 두 알고리즘의 효율성을 입증하기 위해 광범위한 평가를 수행하였으며 DISC 및 DenForest 는 최신의 밀도 기반 클러스터링 알고리즘들보다 뛰어난 성능을 보여주었습니다.1 Introduction 1 1.1 Overview of Dissertation 3 2 Related Works 7 2.1 Clustering 7 2.2 Density-Based Clustering for Static Datasets 8 2.2.1 Extension of DBSCAN 8 2.2.2 Approximation of Density-Based Clustering 9 2.2.3 Parallelization of Density-Based Clustering 10 2.3 Incremental Density-Based Clustering 10 2.3.1 Approximated Density-Based Clustering for Dynamic Datasets 11 2.4 Density-Based Clustering for Data Streams 11 2.4.1 Micro-clusters 12 2.4.2 Density-Based Clustering in Damped Window Model 12 2.4.3 Density-Based Clustering in Sliding Window Model 13 2.5 Non-Density-Based Clustering 14 2.5.1 Partitional Clustering and Hierarchical Clustering 14 2.5.2 Distribution-Based Clustering 15 2.5.3 High-Dimensional Data Clustering 15 2.5.4 Spectral Clustering 16 3 Background 17 3.1 DBSCAN 17 3.1.1 Reformulation of Density-Based Clustering 19 3.2 Incremental DBSCAN 20 3.3 Sliding Windows 22 3.3.1 Density-Based Clustering over Sliding Windows 23 3.3.2 Slow Deletion Problem 24 4 Avoiding Redundant Searches in Updating Clusters 26 4.1 The DISC Algorithm 27 4.1.1 Overview of DISC 27 4.1.2 COLLECT 29 4.1.3 CLUSTER 30 4.1.3.1 Splitting a Cluster 32 4.1.3.2 Merging Clusters 37 4.1.4 Horizontal Manner vs. Vertical Manner 38 4.2 Checking Reachability 39 4.2.1 Multi-Starter BFS 40 4.2.2 Epoch-Based Probing of R-tree Index 41 4.3 Updating Labels 43 5 Avoiding Graph Traversals in Updating Clusters 45 5.1 The DenForest Algorithm 46 5.1.1 Overview of DenForest 47 5.1.1.1 Supported Types of the Sliding Window Model 48 5.1.2 Nostalgic Core and Density-based Clusters 49 5.1.2.1 Cluster Membership of Border 51 5.1.3 DenTree 51 5.2 Operations of DenForest 54 5.2.1 Insertion 54 5.2.1.1 MST based on Link-Cut Tree 57 5.2.1.2 Time Complexity of Insert Operation 58 5.2.2 Deletion 59 5.2.2.1 Time Complexity of Delete Operation 61 5.2.3 Insertion/Deletion Examples 64 5.2.4 Cluster Membership 65 5.2.5 Batch-Optimized Update 65 5.3 Clustering Quality of DenForest 68 5.3.1 Clustering Quality for Static Data 68 5.3.2 Discussion 70 5.3.3 Replaceability 70 5.3.3.1 Nostalgic Cores and Density 71 5.3.3.2 Nostalgic Cores and Quality 72 5.3.4 1D Example 74 6 Evaluation 76 6.1 Real-World Datasets 76 6.2 Competing Methods 77 6.2.1 Exact Methods 77 6.2.2 Non-Exact Methods 77 6.3 Experimental Settings 78 6.4 Evaluation of DISC 78 6.4.1 Parameters 79 6.4.2 Baseline Evaluation 79 6.4.3 Drilled-Down Evaluation 82 6.4.3.1 Effects of Threshold Values 82 6.4.3.2 Insertions vs. Deletions 83 6.4.3.3 Range Searches 84 6.4.3.4 MS-BFS and Epoch-Based Probing 85 6.4.4 Comparison with Summarization/Approximation-Based Methods 86 6.5 Evaluation of DenForest 90 6.5.1 Parameters 90 6.5.2 Baseline Evaluation 91 6.5.3 Drilled-Down Evaluation 94 6.5.3.1 Varying Size of Window/Stride 94 6.5.3.2 Effect of Density and Distance Thresholds 95 6.5.3.3 Memory Usage 98 6.5.3.4 Clustering Quality over Sliding Windows 98 6.5.3.5 Clustering Quality under Various Density and Distance Thresholds 101 6.5.3.6 Relaxed Parameter Settings 102 6.5.4 Comparison with Summarization-Based Methods 102 7 Future Work: Extension to Varying/Relative Densities 105 8 Conclusion 107 Abstract (In Korean) 120박

    Concept Matching: Clustering-based Federated Continual Learning

    Full text link
    Federated Continual Learning (FCL) has emerged as a promising paradigm that combines Federated Learning (FL) and Continual Learning (CL). To achieve good model accuracy, FCL needs to tackle catastrophic forgetting due to concept drift over time in CL, and to overcome the potential interference among clients in FL. We propose Concept Matching (CM), a clustering-based framework for FCL to address these challenges. The CM framework groups the client models into concept model clusters, and then builds different global models to capture different concepts in FL over time. In each round, the server sends the global concept models to the clients. To avoid catastrophic forgetting, each client selects the concept model best-matching the concept of the current data for further fine-tuning. To avoid interference among client models with different concepts, the server clusters the models representing the same concept, aggregates the model weights in each cluster, and updates the global concept model with the cluster model of the same concept. Since the server does not know the concepts captured by the aggregated cluster models, we propose a novel server concept matching algorithm that effectively updates a global concept model with a matching cluster model. The CM framework provides flexibility to use different clustering, aggregation, and concept matching algorithms. The evaluation demonstrates that CM outperforms state-of-the-art systems and scales well with the number of clients and the model size

    Forest Fire Clustering: Cluster-oriented Label Propagation Clustering and Monte Carlo Verification Inspired by Forest Fire Dynamics

    Full text link
    Clustering methods group data points together and assign them group-level labels. However, it has been difficult to evaluate the confidence of the clustering results. Here, we introduce a novel method that could not only find robust clusters but also provide a confidence score for the labels of each data point. Specifically, we reformulated label-propagation clustering to model after forest fire dynamics. The method has only one parameter - a fire temperature term describing how easily one label propagates from one node to the next. Through iteratively starting label propagations through a graph, we can discover the number of clusters in a dataset with minimum prior assumptions. Further, we can validate our predictions and uncover the posterior probability distribution of the labels using Monte Carlo simulations. Lastly, our iterative method is inductive and does not need to be retrained with the arrival of new data. Here, we describe the method and provide a summary of how the method performs against common clustering benchmarks.Comment: 9 pages, 8 figure

    Strategies and algorithms for clustering large datasets: a review

    Get PDF
    The exploratory nature of data analysis and data mining makes clustering one of the most usual tasks in these kind of projects. More frequently these projects come from many different application areas like biology, text analysis, signal analysis, etc that involve larger and larger datasets in the number of examples and the number of attributes. Classical methods for clustering data like K-means or hierarchical clustering are beginning to reach its maximum capability to cope with this increase of dataset size. The limitation for these algorithms come either from the need of storing all the data in memory or because of their computational time complexity. These problems have opened an area for the search of algorithms able to reduce this data overload. Some solutions come from the side of data preprocessing by transforming the data to a lower dimensionality manifold that represents the structure of the data or by summarizing the dataset by obtaining a smaller subset of examples that represent an equivalent information. A different perspective is to modify the classical clustering algorithms or to derive other ones able to cluster larger datasets. This perspective relies on many different strategies. Techniques such as sampling, on-line processing, summarization, data distribution and efficient datastructures have being applied to the problem of scaling clustering algorithms. This paper presents a review of different strategies and clustering algorithms that apply these techniques. The aim is to cover the different range of methodologies applied for clustering data and how they can be scaled.Preprin

    Feature space curvature map: A method to homogenize cluster densities

    Get PDF
    The majority of density-based clustering algorithms can not perform properly when data expose very different density through the feature space. These algorithms implicitly presume that all clusters almost have the same density, therefore, they normally use global parameters. Consequently, they are often biased towards finding dense clusters in front of sparse ones. In this paper, we propose a parametric multilinear transformation method to homogenize cluster densities while preserving the topological structure of the dataset. The transformed clusters have approximately the same density while all inter-cluster regions become globally low-density. In our method, the feature space is locally bent by dense data point concentrations the same way as stars bend the space-time dimensions in Theory of Relativity. We present a new Gravitational Self-organization Map to model the feature space curvature by plugging the concepts of gravity and fabric of space into the Self-organization Map algorithm to mathematically describe the density structure of the data. To homogenize the cluster density, we introduce a novel mapping mechanism to project the data from a non-Euclidean curved space to a new Euclidean flat space. Specifically, this mechanism transfers the basis vectors instead of the feature vectors to guarantee the continuity of the mapping function and optimize the computation cost of the algorithm. As a result, our method can efficiently and explicitly homogenize the density of any dataset globally to then apply existing clustering algorithms without modification. Our experimental results over both real-world and synthetic datasets show that our approach outperforms the current statistical-based methods.Peer ReviewedPostprint (author's final draft
    corecore