13,101 research outputs found

    MOA: Massive Online Analysis, a framework for stream classification and clustering.

    Get PDF
    Massive Online Analysis (MOA) is a software environment for implementing algorithms and running experiments for online learning from evolving data streams. MOA is designed to deal with the challenging problem of scaling up the implementation of state of the art algorithms to real world dataset sizes. It contains collection of offline and online for both classification and clustering as well as tools for evaluation. In particular, for classification it implements boosting, bagging, and Hoeffding Trees, all with and without Naive Bayes classifiers at the leaves. For clustering, it implements StreamKM++, CluStream, ClusTree, Den-Stream, D-Stream and CobWeb. Researchers benefit from MOA by getting insights into workings and problems of different approaches, practitioners can easily apply and compare several algorithms to real world data set and settings. MOA supports bi-directional interaction with WEKA, the Waikato Environment for Knowledge Analysis, and is released under the GNU GPL license

    Optimizing Data Stream Representation: An Extensive Survey on Stream Clustering Algorithms

    Get PDF
    Abstract Analyzing data streams has received considerable attention over the past decades due to the widespread usage of sensors, social media and other streaming data sources. A core research area in this field is stream clustering which aims to recognize patterns in an unordered, infinite and evolving stream of observations. Clustering can be a crucial support in decision making, since it aims for an optimized aggregated representation of a continuous data stream over time and allows to identify patterns in large and high-dimensional data. A multitude of algorithms and approaches has been developed that are able to find and maintain clusters over time in the challenging streaming scenario. This survey explores, summarizes and categorizes a total of 51 stream clustering algorithms and identifies core research threads over the past decades. In particular, it identifies categories of algorithms based on distance thresholds, density grids and statistical models as well as algorithms for high dimensional data. Furthermore, it discusses applications scenarios, available software and how to configure stream clustering algorithms. This survey is considerably more extensive than comparable studies, more up-to-date and highlights how concepts are interrelated and have been developed over time

    Automatically Selecting Parameters for Graph-Based Clustering

    Get PDF
    Data streams present a number of challenges, caused by change in stream concepts over time. In this thesis we present a novel method for detection of concept drift within data streams by analysing geometric features of the clustering algorithm, RepStream. Further, we present novel methods for automatically adjusting critical input parameters over time, and generating self-organising nearest-neighbour graphs, improving robustness and decreasing the need to domain-specific knowledge in the face of stream evolution

    Introduction to stream: An Extensible Framework for Data Stream Clustering Research with R

    Get PDF
    In recent years, data streams have become an increasingly important area of research for the computer science, database and statistics communities. Data streams are ordered and potentially unbounded sequences of data points created by a typically non-stationary data generating process. Common data mining tasks associated with data streams include clustering, classification and frequent pattern mining. New algorithms for these types of data are proposed regularly and it is important to evaluate them thoroughly under standardized conditions. In this paper we introduce stream, a research tool that includes modeling and simulating data streams as well as an extensible framework for implementing, interfacing and experimenting with algorithms for various data stream mining tasks. The main advantage of stream is that it seamlessly integrates with the large existing infrastructure provided by R. In addition to data handling, plotting and easy scripting capabilities, R also provides many existing algorithms and enables users to interface code written in many programming languages popular among data mining researchers (e.g., C/C++, Java and Python). In this paper we describe the architecture of stream and focus on its use for data stream clustering research. stream was implemented with extensibility in mind and will be extended in the future to cover additional data stream mining tasks like classification and frequent pattern mining

    Clustering Process for Mixed Dataset Using Shortest Path Non Parameterised Technique

    Get PDF
    Clustering in mixed dataset is a dynamic research focus in data mining concepts. The predictable clustering algorithm related to be more supportive to only one kind of attribute not for the mixed data type. Hence, the traditional clustering techniques processed with mixed attributes either by converting the numerical data type to categorical type or categorical type to numerical data type. But, utmost of the clustering processes are improved by converting numerical attributes. This progression of grouping ends up with two boundaries, the earlier limitation is that conveying numerical values to all types of categorical data is simply difficult. On the other hand the later drawback lies in the parameterized clustering which needs number of clusters as response for grouping the datasets. To succeed over the limitations the clustering technique is organised by incorporating shortest path and non-parameterized clustering. The proposed work of Shortest path non parameterised Clustering technique, the input parameter (number of clusters) is discovered spontaneously and the data objects of the cluster are grouped that are at the shortest distance

    Exploratory Cluster Analysis from Ubiquitous Data Streams using Self-Organizing Maps

    Get PDF
    This thesis addresses the use of Self-Organizing Maps (SOM) for exploratory cluster analysis over ubiquitous data streams, where two complementary problems arise: first, to generate (local) SOM models over potentially unbounded multi-dimensional non-stationary data streams; second, to extrapolate these capabilities to ubiquitous environments. Towards this problematic, original contributions are made in terms of algorithms and methodologies. Two different methods are proposed regarding the first problem. By focusing on visual knowledge discovery, these methods fill an existing gap in the panorama of current methods for cluster analysis over data streams. Moreover, the original SOM capabilities in performing both clustering of observations and features are transposed to data streams, characterizing these contributions as versatile compared to existing methods, which target an individual clustering problem. Also, additional methodologies that tackle the ubiquitous aspect of data streams are proposed in respect to the second problem, allowing distributed and collaborative learning strategies. Experimental evaluations attest the effectiveness of the proposed methods and realworld applications are exemplified, namely regarding electric consumption data, air quality monitoring networks and financial data, motivating their practical use. This research study is the first to clearly address the use of the SOM towards ubiquitous data streams and opens several other research opportunities in the future

    ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ์ƒ์˜ ๋น ๋ฅธ ์ ์ง„์  ๋ฐ€๋„ ๊ธฐ๋ฐ˜ ํด๋Ÿฌ์Šคํ„ฐ๋ง

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2022. 8. ๋ฌธ๋ด‰๊ธฐ.Given the prevalence of mobile and IoT devices, continuous clustering against streaming data has become an essential tool of increasing importance for data analytics. Among many clustering approaches, density-based clustering has garnered much attention due to its unique advantage that it can detect clusters of an arbitrary shape when noise exists. However, when the clusters need to be updated continuously along with an evolving input dataset, a relatively high computational cost is required. Particularly, deleting data points from the clusters causes severe performance degradation. In this dissertation, the performance limits of the incremental density-based clustering over sliding windows are addressed. Ultimately, two algorithms, DISC and DenForest, are proposed. The first algorithm DISC is an incremental density-based clustering algorithm that efficiently produces the same clustering results as DBSCAN over sliding windows. It focuses on redundancy issues that occur when updating clusters. When multiple data points are inserted or deleted individually, surrounding data points are explored and retrieved redundantly. DISC addresses these issues and improves the performance by updating multiple points in a batch. It also presents several optimization techniques. The second algorithm DenForest is an incremental density-based clustering algorithm that primarily focuses on the deletion process. Unlike previous methods that manage clusters as a graph, DenForest manages clusters as a group of spanning trees, which contributes to very efficient deletion performance. Moreover, it provides a batch-optimized technique to improve the insertion performance. To prove the effectiveness of the two algorithms, extensive evaluations were conducted, and it is demonstrated that DISC and DenForest outperform the state-of-the-art density-based clustering algorithms significantly.๋ชจ๋ฐ”์ผ ๋ฐ IoT ์žฅ์น˜๊ฐ€ ๋„๋ฆฌ ๋ณด๊ธ‰๋จ์— ๋”ฐ๋ผ ์ŠคํŠธ๋ฆฌ๋ฐ ๋ฐ์ดํ„ฐ์ƒ์—์„œ ์ง€์†์ ์œผ๋กœ ํด๋Ÿฌ์Šคํ„ฐ๋ง ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์€ ๋ฐ์ดํ„ฐ ๋ถ„์„์—์„œ ์ ์  ๋” ์ค‘์š”ํ•ด์ง€๋Š” ํ•„์ˆ˜ ๋„๊ตฌ๊ฐ€ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๋งŽ์€ ํด๋Ÿฌ์Šคํ„ฐ๋ง ๋ฐฉ๋ฒ• ์ค‘์—์„œ ๋ฐ€๋„ ๊ธฐ๋ฐ˜ ํด๋Ÿฌ์Šคํ„ฐ๋ง์€ ๋…ธ์ด์ฆˆ๊ฐ€ ์กด์žฌํ•  ๋•Œ ์ž„์˜์˜ ๋ชจ์–‘์˜ ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ๊ฐ์ง€ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ณ ์œ ํ•œ ์žฅ์ ์„ ๊ฐ€์ง€๊ณ  ์žˆ์œผ๋ฉฐ ์ด์— ๋”ฐ๋ผ ๋งŽ์€ ๊ด€์‹ฌ์„ ๋ฐ›์•˜์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋ฐ€๋„ ๊ธฐ๋ฐ˜ ํด๋Ÿฌ์Šคํ„ฐ๋ง์€ ๋ณ€ํ™”ํ•˜๋Š” ์ž…๋ ฅ ๋ฐ์ดํ„ฐ ์…‹์— ๋”ฐ๋ผ ์ง€์†์ ์œผ๋กœ ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ์—…๋ฐ์ดํŠธํ•ด์•ผ ํ•˜๋Š” ๊ฒฝ์šฐ ๋น„๊ต์  ๋†’์€ ๊ณ„์‚ฐ ๋น„์šฉ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ, ํด๋Ÿฌ์Šคํ„ฐ์—์„œ์˜ ๋ฐ์ดํ„ฐ ์ ๋“ค์˜ ์‚ญ์ œ๋Š” ์‹ฌ๊ฐํ•œ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ์ดˆ๋ž˜ํ•ฉ๋‹ˆ๋‹ค. ๋ณธ ๋ฐ•์‚ฌ ํ•™์œ„ ๋…ผ๋ฌธ์—์„œ๋Š” ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ์ƒ์˜ ๋ฐ€๋„ ๊ธฐ๋ฐ˜ ํด๋Ÿฌ์Šคํ„ฐ๋ง์˜ ์„ฑ๋Šฅ ํ•œ๊ณ„๋ฅผ ๋‹ค๋ฃจ๋ฉฐ ๊ถ๊ทน์ ์œผ๋กœ ๋‘ ๊ฐ€์ง€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ฒซ ๋ฒˆ์งธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ธ DISC๋Š” ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ์ƒ์—์„œ DBSCAN๊ณผ ๋™์ผํ•œ ํด๋Ÿฌ์Šคํ„ฐ๋ง ๊ฒฐ๊ณผ๋ฅผ ์ฐพ๋Š” ์ ์ง„์  ๋ฐ€๋„ ๊ธฐ๋ฐ˜ ํด๋Ÿฌ์Šคํ„ฐ๋ง ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค. ํ•ด๋‹น ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ํด๋Ÿฌ์Šคํ„ฐ ์—…๋ฐ์ดํŠธ ์‹œ์— ๋ฐœ์ƒํ•˜๋Š” ์ค‘๋ณต ๋ฌธ์ œ๋“ค์— ์ดˆ์ ์„ ๋‘ก๋‹ˆ๋‹ค. ๋ฐ€๋„ ๊ธฐ๋ฐ˜ ํด๋Ÿฌ์Šคํ„ฐ๋ง์—์„œ๋Š” ์—ฌ๋Ÿฌ ๋ฐ์ดํ„ฐ ์ ๋“ค์„ ๊ฐœ๋ณ„์ ์œผ๋กœ ์‚ฝ์ž… ํ˜น์€ ์‚ญ์ œํ•  ๋•Œ ์ฃผ๋ณ€ ์ ๋“ค์„ ๋ถˆํ•„์š”ํ•˜๊ฒŒ ์ค‘๋ณต์ ์œผ๋กœ ํƒ์ƒ‰ํ•˜๊ณ  ํšŒ์ˆ˜ํ•ฉ๋‹ˆ๋‹ค. DISC ๋Š” ๋ฐฐ์น˜ ์—…๋ฐ์ดํŠธ๋กœ ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜์—ฌ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋ฉฐ ์—ฌ๋Ÿฌ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•๋“ค์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๋‘ ๋ฒˆ์งธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ธ DenForest ๋Š” ์‚ญ์ œ ๊ณผ์ •์— ์ดˆ์ ์„ ๋‘” ์ ์ง„์  ๋ฐ€๋„ ๊ธฐ๋ฐ˜ ํด๋Ÿฌ์Šคํ„ฐ๋ง ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค. ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ๊ทธ๋ž˜ํ”„๋กœ ๊ด€๋ฆฌํ•˜๋Š” ์ด์ „ ๋ฐฉ๋ฒ•๋“ค๊ณผ ๋‹ฌ๋ฆฌ DenForest ๋Š” ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ์‹ ์žฅ ํŠธ๋ฆฌ์˜ ๊ทธ๋ฃน์œผ๋กœ ๊ด€๋ฆฌํ•จ์œผ๋กœ์จ ํšจ์œจ์ ์ธ ์‚ญ์ œ ์„ฑ๋Šฅ์— ๊ธฐ์—ฌํ•ฉ๋‹ˆ๋‹ค. ๋‚˜์•„๊ฐ€ ๋ฐฐ์น˜ ์ตœ์ ํ™” ๊ธฐ๋ฒ•์„ ํ†ตํ•ด ์‚ฝ์ž… ์„ฑ๋Šฅ ํ–ฅ์ƒ์—๋„ ๊ธฐ์—ฌํ•ฉ๋‹ˆ๋‹ค. ๋‘ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํšจ์œจ์„ฑ์„ ์ž…์ฆํ•˜๊ธฐ ์œ„ํ•ด ๊ด‘๋ฒ”์œ„ํ•œ ํ‰๊ฐ€๋ฅผ ์ˆ˜ํ–‰ํ•˜์˜€์œผ๋ฉฐ DISC ๋ฐ DenForest ๋Š” ์ตœ์‹ ์˜ ๋ฐ€๋„ ๊ธฐ๋ฐ˜ ํด๋Ÿฌ์Šคํ„ฐ๋ง ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค๋ณด๋‹ค ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.1 Introduction 1 1.1 Overview of Dissertation 3 2 Related Works 7 2.1 Clustering 7 2.2 Density-Based Clustering for Static Datasets 8 2.2.1 Extension of DBSCAN 8 2.2.2 Approximation of Density-Based Clustering 9 2.2.3 Parallelization of Density-Based Clustering 10 2.3 Incremental Density-Based Clustering 10 2.3.1 Approximated Density-Based Clustering for Dynamic Datasets 11 2.4 Density-Based Clustering for Data Streams 11 2.4.1 Micro-clusters 12 2.4.2 Density-Based Clustering in Damped Window Model 12 2.4.3 Density-Based Clustering in Sliding Window Model 13 2.5 Non-Density-Based Clustering 14 2.5.1 Partitional Clustering and Hierarchical Clustering 14 2.5.2 Distribution-Based Clustering 15 2.5.3 High-Dimensional Data Clustering 15 2.5.4 Spectral Clustering 16 3 Background 17 3.1 DBSCAN 17 3.1.1 Reformulation of Density-Based Clustering 19 3.2 Incremental DBSCAN 20 3.3 Sliding Windows 22 3.3.1 Density-Based Clustering over Sliding Windows 23 3.3.2 Slow Deletion Problem 24 4 Avoiding Redundant Searches in Updating Clusters 26 4.1 The DISC Algorithm 27 4.1.1 Overview of DISC 27 4.1.2 COLLECT 29 4.1.3 CLUSTER 30 4.1.3.1 Splitting a Cluster 32 4.1.3.2 Merging Clusters 37 4.1.4 Horizontal Manner vs. Vertical Manner 38 4.2 Checking Reachability 39 4.2.1 Multi-Starter BFS 40 4.2.2 Epoch-Based Probing of R-tree Index 41 4.3 Updating Labels 43 5 Avoiding Graph Traversals in Updating Clusters 45 5.1 The DenForest Algorithm 46 5.1.1 Overview of DenForest 47 5.1.1.1 Supported Types of the Sliding Window Model 48 5.1.2 Nostalgic Core and Density-based Clusters 49 5.1.2.1 Cluster Membership of Border 51 5.1.3 DenTree 51 5.2 Operations of DenForest 54 5.2.1 Insertion 54 5.2.1.1 MST based on Link-Cut Tree 57 5.2.1.2 Time Complexity of Insert Operation 58 5.2.2 Deletion 59 5.2.2.1 Time Complexity of Delete Operation 61 5.2.3 Insertion/Deletion Examples 64 5.2.4 Cluster Membership 65 5.2.5 Batch-Optimized Update 65 5.3 Clustering Quality of DenForest 68 5.3.1 Clustering Quality for Static Data 68 5.3.2 Discussion 70 5.3.3 Replaceability 70 5.3.3.1 Nostalgic Cores and Density 71 5.3.3.2 Nostalgic Cores and Quality 72 5.3.4 1D Example 74 6 Evaluation 76 6.1 Real-World Datasets 76 6.2 Competing Methods 77 6.2.1 Exact Methods 77 6.2.2 Non-Exact Methods 77 6.3 Experimental Settings 78 6.4 Evaluation of DISC 78 6.4.1 Parameters 79 6.4.2 Baseline Evaluation 79 6.4.3 Drilled-Down Evaluation 82 6.4.3.1 Effects of Threshold Values 82 6.4.3.2 Insertions vs. Deletions 83 6.4.3.3 Range Searches 84 6.4.3.4 MS-BFS and Epoch-Based Probing 85 6.4.4 Comparison with Summarization/Approximation-Based Methods 86 6.5 Evaluation of DenForest 90 6.5.1 Parameters 90 6.5.2 Baseline Evaluation 91 6.5.3 Drilled-Down Evaluation 94 6.5.3.1 Varying Size of Window/Stride 94 6.5.3.2 Effect of Density and Distance Thresholds 95 6.5.3.3 Memory Usage 98 6.5.3.4 Clustering Quality over Sliding Windows 98 6.5.3.5 Clustering Quality under Various Density and Distance Thresholds 101 6.5.3.6 Relaxed Parameter Settings 102 6.5.4 Comparison with Summarization-Based Methods 102 7 Future Work: Extension to Varying/Relative Densities 105 8 Conclusion 107 Abstract (In Korean) 120๋ฐ•
    • โ€ฆ
    corecore