8,445 research outputs found

    Building an IT Taxonomy with Co-occurrence Analysis, Hierarchical Clustering, and Multidimensional Scaling

    Get PDF
    Different information technologies (ITs) are related in complex ways. How can the relationships among a large number of ITs be described and analyzed in a representative, dynamic, and scalable way? In this study, we employed co-occurrence analysis to explore the relationships among 50 information technologies discussed in six magazines over ten years (1998-2007). Using hierarchical clustering and multidimensional scaling, we have found that the similarities of the technologies can be depicted in hierarchies and two-dimensional plots, and that similar technologies can be classified into meaningful categories. The results imply reasonable validity of our approach for understanding technology relationships and building an IT taxonomy. The methodology that we offer not only helps IT practitioners and researchers make sense of numerous technologies in the iField but also bridges two related but thus far largely separate research streams in iSchools - information management and IT management

    Evidential Evolving Gustafson-Kessel Algorithm For Online Data Streams Partitioning Using Belief Function Theory.

    Get PDF
    International audienceA new online clustering method called E2GK (Evidential Evolving Gustafson-Kessel) is introduced. This partitional clustering algorithm is based on the concept of credal partition defined in the theoretical framework of belief functions. A credal partition is derived online by applying an algorithm resulting from the adaptation of the Evolving Gustafson-Kessel (EGK) algorithm. Online partitioning of data streams is then possible with a meaningful interpretation of the data structure. A comparative study with the original online procedure shows that E2GK outperforms EGK on different entry data sets. To show the performance of E2GK, several experiments have been conducted on synthetic data sets as well as on data collected from a real application problem. A study of parameters' sensitivity is also carried out and solutions are proposed to limit complexity issues

    ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ์ƒ์˜ ๋น ๋ฅธ ์ ์ง„์  ๋ฐ€๋„ ๊ธฐ๋ฐ˜ ํด๋Ÿฌ์Šคํ„ฐ๋ง

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2022. 8. ๋ฌธ๋ด‰๊ธฐ.Given the prevalence of mobile and IoT devices, continuous clustering against streaming data has become an essential tool of increasing importance for data analytics. Among many clustering approaches, density-based clustering has garnered much attention due to its unique advantage that it can detect clusters of an arbitrary shape when noise exists. However, when the clusters need to be updated continuously along with an evolving input dataset, a relatively high computational cost is required. Particularly, deleting data points from the clusters causes severe performance degradation. In this dissertation, the performance limits of the incremental density-based clustering over sliding windows are addressed. Ultimately, two algorithms, DISC and DenForest, are proposed. The first algorithm DISC is an incremental density-based clustering algorithm that efficiently produces the same clustering results as DBSCAN over sliding windows. It focuses on redundancy issues that occur when updating clusters. When multiple data points are inserted or deleted individually, surrounding data points are explored and retrieved redundantly. DISC addresses these issues and improves the performance by updating multiple points in a batch. It also presents several optimization techniques. The second algorithm DenForest is an incremental density-based clustering algorithm that primarily focuses on the deletion process. Unlike previous methods that manage clusters as a graph, DenForest manages clusters as a group of spanning trees, which contributes to very efficient deletion performance. Moreover, it provides a batch-optimized technique to improve the insertion performance. To prove the effectiveness of the two algorithms, extensive evaluations were conducted, and it is demonstrated that DISC and DenForest outperform the state-of-the-art density-based clustering algorithms significantly.๋ชจ๋ฐ”์ผ ๋ฐ IoT ์žฅ์น˜๊ฐ€ ๋„๋ฆฌ ๋ณด๊ธ‰๋จ์— ๋”ฐ๋ผ ์ŠคํŠธ๋ฆฌ๋ฐ ๋ฐ์ดํ„ฐ์ƒ์—์„œ ์ง€์†์ ์œผ๋กœ ํด๋Ÿฌ์Šคํ„ฐ๋ง ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์€ ๋ฐ์ดํ„ฐ ๋ถ„์„์—์„œ ์ ์  ๋” ์ค‘์š”ํ•ด์ง€๋Š” ํ•„์ˆ˜ ๋„๊ตฌ๊ฐ€ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๋งŽ์€ ํด๋Ÿฌ์Šคํ„ฐ๋ง ๋ฐฉ๋ฒ• ์ค‘์—์„œ ๋ฐ€๋„ ๊ธฐ๋ฐ˜ ํด๋Ÿฌ์Šคํ„ฐ๋ง์€ ๋…ธ์ด์ฆˆ๊ฐ€ ์กด์žฌํ•  ๋•Œ ์ž„์˜์˜ ๋ชจ์–‘์˜ ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ๊ฐ์ง€ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ณ ์œ ํ•œ ์žฅ์ ์„ ๊ฐ€์ง€๊ณ  ์žˆ์œผ๋ฉฐ ์ด์— ๋”ฐ๋ผ ๋งŽ์€ ๊ด€์‹ฌ์„ ๋ฐ›์•˜์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋ฐ€๋„ ๊ธฐ๋ฐ˜ ํด๋Ÿฌ์Šคํ„ฐ๋ง์€ ๋ณ€ํ™”ํ•˜๋Š” ์ž…๋ ฅ ๋ฐ์ดํ„ฐ ์…‹์— ๋”ฐ๋ผ ์ง€์†์ ์œผ๋กœ ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ์—…๋ฐ์ดํŠธํ•ด์•ผ ํ•˜๋Š” ๊ฒฝ์šฐ ๋น„๊ต์  ๋†’์€ ๊ณ„์‚ฐ ๋น„์šฉ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ, ํด๋Ÿฌ์Šคํ„ฐ์—์„œ์˜ ๋ฐ์ดํ„ฐ ์ ๋“ค์˜ ์‚ญ์ œ๋Š” ์‹ฌ๊ฐํ•œ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ์ดˆ๋ž˜ํ•ฉ๋‹ˆ๋‹ค. ๋ณธ ๋ฐ•์‚ฌ ํ•™์œ„ ๋…ผ๋ฌธ์—์„œ๋Š” ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ์ƒ์˜ ๋ฐ€๋„ ๊ธฐ๋ฐ˜ ํด๋Ÿฌ์Šคํ„ฐ๋ง์˜ ์„ฑ๋Šฅ ํ•œ๊ณ„๋ฅผ ๋‹ค๋ฃจ๋ฉฐ ๊ถ๊ทน์ ์œผ๋กœ ๋‘ ๊ฐ€์ง€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ฒซ ๋ฒˆ์งธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ธ DISC๋Š” ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ์ƒ์—์„œ DBSCAN๊ณผ ๋™์ผํ•œ ํด๋Ÿฌ์Šคํ„ฐ๋ง ๊ฒฐ๊ณผ๋ฅผ ์ฐพ๋Š” ์ ์ง„์  ๋ฐ€๋„ ๊ธฐ๋ฐ˜ ํด๋Ÿฌ์Šคํ„ฐ๋ง ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค. ํ•ด๋‹น ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ํด๋Ÿฌ์Šคํ„ฐ ์—…๋ฐ์ดํŠธ ์‹œ์— ๋ฐœ์ƒํ•˜๋Š” ์ค‘๋ณต ๋ฌธ์ œ๋“ค์— ์ดˆ์ ์„ ๋‘ก๋‹ˆ๋‹ค. ๋ฐ€๋„ ๊ธฐ๋ฐ˜ ํด๋Ÿฌ์Šคํ„ฐ๋ง์—์„œ๋Š” ์—ฌ๋Ÿฌ ๋ฐ์ดํ„ฐ ์ ๋“ค์„ ๊ฐœ๋ณ„์ ์œผ๋กœ ์‚ฝ์ž… ํ˜น์€ ์‚ญ์ œํ•  ๋•Œ ์ฃผ๋ณ€ ์ ๋“ค์„ ๋ถˆํ•„์š”ํ•˜๊ฒŒ ์ค‘๋ณต์ ์œผ๋กœ ํƒ์ƒ‰ํ•˜๊ณ  ํšŒ์ˆ˜ํ•ฉ๋‹ˆ๋‹ค. DISC ๋Š” ๋ฐฐ์น˜ ์—…๋ฐ์ดํŠธ๋กœ ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜์—ฌ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋ฉฐ ์—ฌ๋Ÿฌ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•๋“ค์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๋‘ ๋ฒˆ์งธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ธ DenForest ๋Š” ์‚ญ์ œ ๊ณผ์ •์— ์ดˆ์ ์„ ๋‘” ์ ์ง„์  ๋ฐ€๋„ ๊ธฐ๋ฐ˜ ํด๋Ÿฌ์Šคํ„ฐ๋ง ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค. ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ๊ทธ๋ž˜ํ”„๋กœ ๊ด€๋ฆฌํ•˜๋Š” ์ด์ „ ๋ฐฉ๋ฒ•๋“ค๊ณผ ๋‹ฌ๋ฆฌ DenForest ๋Š” ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ์‹ ์žฅ ํŠธ๋ฆฌ์˜ ๊ทธ๋ฃน์œผ๋กœ ๊ด€๋ฆฌํ•จ์œผ๋กœ์จ ํšจ์œจ์ ์ธ ์‚ญ์ œ ์„ฑ๋Šฅ์— ๊ธฐ์—ฌํ•ฉ๋‹ˆ๋‹ค. ๋‚˜์•„๊ฐ€ ๋ฐฐ์น˜ ์ตœ์ ํ™” ๊ธฐ๋ฒ•์„ ํ†ตํ•ด ์‚ฝ์ž… ์„ฑ๋Šฅ ํ–ฅ์ƒ์—๋„ ๊ธฐ์—ฌํ•ฉ๋‹ˆ๋‹ค. ๋‘ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํšจ์œจ์„ฑ์„ ์ž…์ฆํ•˜๊ธฐ ์œ„ํ•ด ๊ด‘๋ฒ”์œ„ํ•œ ํ‰๊ฐ€๋ฅผ ์ˆ˜ํ–‰ํ•˜์˜€์œผ๋ฉฐ DISC ๋ฐ DenForest ๋Š” ์ตœ์‹ ์˜ ๋ฐ€๋„ ๊ธฐ๋ฐ˜ ํด๋Ÿฌ์Šคํ„ฐ๋ง ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค๋ณด๋‹ค ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.1 Introduction 1 1.1 Overview of Dissertation 3 2 Related Works 7 2.1 Clustering 7 2.2 Density-Based Clustering for Static Datasets 8 2.2.1 Extension of DBSCAN 8 2.2.2 Approximation of Density-Based Clustering 9 2.2.3 Parallelization of Density-Based Clustering 10 2.3 Incremental Density-Based Clustering 10 2.3.1 Approximated Density-Based Clustering for Dynamic Datasets 11 2.4 Density-Based Clustering for Data Streams 11 2.4.1 Micro-clusters 12 2.4.2 Density-Based Clustering in Damped Window Model 12 2.4.3 Density-Based Clustering in Sliding Window Model 13 2.5 Non-Density-Based Clustering 14 2.5.1 Partitional Clustering and Hierarchical Clustering 14 2.5.2 Distribution-Based Clustering 15 2.5.3 High-Dimensional Data Clustering 15 2.5.4 Spectral Clustering 16 3 Background 17 3.1 DBSCAN 17 3.1.1 Reformulation of Density-Based Clustering 19 3.2 Incremental DBSCAN 20 3.3 Sliding Windows 22 3.3.1 Density-Based Clustering over Sliding Windows 23 3.3.2 Slow Deletion Problem 24 4 Avoiding Redundant Searches in Updating Clusters 26 4.1 The DISC Algorithm 27 4.1.1 Overview of DISC 27 4.1.2 COLLECT 29 4.1.3 CLUSTER 30 4.1.3.1 Splitting a Cluster 32 4.1.3.2 Merging Clusters 37 4.1.4 Horizontal Manner vs. Vertical Manner 38 4.2 Checking Reachability 39 4.2.1 Multi-Starter BFS 40 4.2.2 Epoch-Based Probing of R-tree Index 41 4.3 Updating Labels 43 5 Avoiding Graph Traversals in Updating Clusters 45 5.1 The DenForest Algorithm 46 5.1.1 Overview of DenForest 47 5.1.1.1 Supported Types of the Sliding Window Model 48 5.1.2 Nostalgic Core and Density-based Clusters 49 5.1.2.1 Cluster Membership of Border 51 5.1.3 DenTree 51 5.2 Operations of DenForest 54 5.2.1 Insertion 54 5.2.1.1 MST based on Link-Cut Tree 57 5.2.1.2 Time Complexity of Insert Operation 58 5.2.2 Deletion 59 5.2.2.1 Time Complexity of Delete Operation 61 5.2.3 Insertion/Deletion Examples 64 5.2.4 Cluster Membership 65 5.2.5 Batch-Optimized Update 65 5.3 Clustering Quality of DenForest 68 5.3.1 Clustering Quality for Static Data 68 5.3.2 Discussion 70 5.3.3 Replaceability 70 5.3.3.1 Nostalgic Cores and Density 71 5.3.3.2 Nostalgic Cores and Quality 72 5.3.4 1D Example 74 6 Evaluation 76 6.1 Real-World Datasets 76 6.2 Competing Methods 77 6.2.1 Exact Methods 77 6.2.2 Non-Exact Methods 77 6.3 Experimental Settings 78 6.4 Evaluation of DISC 78 6.4.1 Parameters 79 6.4.2 Baseline Evaluation 79 6.4.3 Drilled-Down Evaluation 82 6.4.3.1 Effects of Threshold Values 82 6.4.3.2 Insertions vs. Deletions 83 6.4.3.3 Range Searches 84 6.4.3.4 MS-BFS and Epoch-Based Probing 85 6.4.4 Comparison with Summarization/Approximation-Based Methods 86 6.5 Evaluation of DenForest 90 6.5.1 Parameters 90 6.5.2 Baseline Evaluation 91 6.5.3 Drilled-Down Evaluation 94 6.5.3.1 Varying Size of Window/Stride 94 6.5.3.2 Effect of Density and Distance Thresholds 95 6.5.3.3 Memory Usage 98 6.5.3.4 Clustering Quality over Sliding Windows 98 6.5.3.5 Clustering Quality under Various Density and Distance Thresholds 101 6.5.3.6 Relaxed Parameter Settings 102 6.5.4 Comparison with Summarization-Based Methods 102 7 Future Work: Extension to Varying/Relative Densities 105 8 Conclusion 107 Abstract (In Korean) 120๋ฐ•
    • โ€ฆ
    corecore