797 research outputs found

    Clustering Algorithms: Their Application to Gene Expression Data

    Get PDF
    Gene expression data hide vital information required to understand the biological process that takes place in a particular organism in relation to its environment. Deciphering the hidden patterns in gene expression data proffers a prodigious preference to strengthen the understanding of functional genomics. The complexity of biological networks and the volume of genes present increase the challenges of comprehending and interpretation of the resulting mass of data, which consists of millions of measurements; these data also inhibit vagueness, imprecision, and noise. Therefore, the use of clustering techniques is a first step toward addressing these challenges, which is essential in the data mining process to reveal natural structures and iden-tify interesting patterns in the underlying data. The clustering of gene expression data has been proven to be useful in making known the natural structure inherent in gene expression data, understanding gene functions, cellular processes, and subtypes of cells, mining useful information from noisy data, and understanding gene regulation. The other benefit of clustering gene expression data is the identification of homology, which is very important in vaccine design. This review examines the various clustering algorithms applicable to the gene expression data in order to discover and provide useful knowledge of the appropriate clustering technique that will guarantee stability and high degree of accuracy in its analysis procedure

    Some Clustering Methods, Algorithms and their Applications

    Get PDF
    Clustering is a type of unsupervised learning [15]. When no target values are known, or "supervisors," in an unsupervised learning task, the purpose is to produce training data from the inputs themselves. Data mining and machine learning would be useless without clustering. If you utilize it to categorize your datasets according to their similarities, you'll be able to predict user behavior more accurately. The purpose of this research is to compare and contrast three widely-used data-clustering methods. Clustering techniques include partitioning, hierarchy, density, grid, and fuzzy clustering. Machine learning, data mining, pattern recognition, image analysis, and bioinformatics are just a few of the many fields where clustering is utilized as an analytical technique. In addition to defining the various algorithms, specialized forms of cluster analysis, linking methods, and please offer a review of the clustering techniques used in the big data setting

    A review of clustering techniques and developments

    Full text link
    ยฉ 2017 Elsevier B.V. This paper presents a comprehensive study on clustering: exiting methods and developments made at various times. Clustering is defined as an unsupervised learning where the objects are grouped on the basis of some similarity inherent among them. There are different methods for clustering the objects such as hierarchical, partitional, grid, density based and model based. The approaches used in these methods are discussed with their respective states of art and applicability. The measures of similarity as well as the evaluation criteria, which are the central components of clustering, are also presented in the paper. The applications of clustering in some fields like image segmentation, object and character recognition and data mining are highlighted

    Neurocognitive Informatics Manifesto.

    Get PDF
    Informatics studies all aspects of the structure of natural and artificial information systems. Theoretical and abstract approaches to information have made great advances, but human information processing is still unmatched in many areas, including information management, representation and understanding. Neurocognitive informatics is a new, emerging field that should help to improve the matching of artificial and natural systems, and inspire better computational algorithms to solve problems that are still beyond the reach of machines. In this position paper examples of neurocognitive inspirations and promising directions in this area are given

    ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ์ƒ์˜ ๋น ๋ฅธ ์ ์ง„์  ๋ฐ€๋„ ๊ธฐ๋ฐ˜ ํด๋Ÿฌ์Šคํ„ฐ๋ง

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2022. 8. ๋ฌธ๋ด‰๊ธฐ.Given the prevalence of mobile and IoT devices, continuous clustering against streaming data has become an essential tool of increasing importance for data analytics. Among many clustering approaches, density-based clustering has garnered much attention due to its unique advantage that it can detect clusters of an arbitrary shape when noise exists. However, when the clusters need to be updated continuously along with an evolving input dataset, a relatively high computational cost is required. Particularly, deleting data points from the clusters causes severe performance degradation. In this dissertation, the performance limits of the incremental density-based clustering over sliding windows are addressed. Ultimately, two algorithms, DISC and DenForest, are proposed. The first algorithm DISC is an incremental density-based clustering algorithm that efficiently produces the same clustering results as DBSCAN over sliding windows. It focuses on redundancy issues that occur when updating clusters. When multiple data points are inserted or deleted individually, surrounding data points are explored and retrieved redundantly. DISC addresses these issues and improves the performance by updating multiple points in a batch. It also presents several optimization techniques. The second algorithm DenForest is an incremental density-based clustering algorithm that primarily focuses on the deletion process. Unlike previous methods that manage clusters as a graph, DenForest manages clusters as a group of spanning trees, which contributes to very efficient deletion performance. Moreover, it provides a batch-optimized technique to improve the insertion performance. To prove the effectiveness of the two algorithms, extensive evaluations were conducted, and it is demonstrated that DISC and DenForest outperform the state-of-the-art density-based clustering algorithms significantly.๋ชจ๋ฐ”์ผ ๋ฐ IoT ์žฅ์น˜๊ฐ€ ๋„๋ฆฌ ๋ณด๊ธ‰๋จ์— ๋”ฐ๋ผ ์ŠคํŠธ๋ฆฌ๋ฐ ๋ฐ์ดํ„ฐ์ƒ์—์„œ ์ง€์†์ ์œผ๋กœ ํด๋Ÿฌ์Šคํ„ฐ๋ง ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์€ ๋ฐ์ดํ„ฐ ๋ถ„์„์—์„œ ์ ์  ๋” ์ค‘์š”ํ•ด์ง€๋Š” ํ•„์ˆ˜ ๋„๊ตฌ๊ฐ€ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๋งŽ์€ ํด๋Ÿฌ์Šคํ„ฐ๋ง ๋ฐฉ๋ฒ• ์ค‘์—์„œ ๋ฐ€๋„ ๊ธฐ๋ฐ˜ ํด๋Ÿฌ์Šคํ„ฐ๋ง์€ ๋…ธ์ด์ฆˆ๊ฐ€ ์กด์žฌํ•  ๋•Œ ์ž„์˜์˜ ๋ชจ์–‘์˜ ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ๊ฐ์ง€ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ณ ์œ ํ•œ ์žฅ์ ์„ ๊ฐ€์ง€๊ณ  ์žˆ์œผ๋ฉฐ ์ด์— ๋”ฐ๋ผ ๋งŽ์€ ๊ด€์‹ฌ์„ ๋ฐ›์•˜์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋ฐ€๋„ ๊ธฐ๋ฐ˜ ํด๋Ÿฌ์Šคํ„ฐ๋ง์€ ๋ณ€ํ™”ํ•˜๋Š” ์ž…๋ ฅ ๋ฐ์ดํ„ฐ ์…‹์— ๋”ฐ๋ผ ์ง€์†์ ์œผ๋กœ ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ์—…๋ฐ์ดํŠธํ•ด์•ผ ํ•˜๋Š” ๊ฒฝ์šฐ ๋น„๊ต์  ๋†’์€ ๊ณ„์‚ฐ ๋น„์šฉ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ, ํด๋Ÿฌ์Šคํ„ฐ์—์„œ์˜ ๋ฐ์ดํ„ฐ ์ ๋“ค์˜ ์‚ญ์ œ๋Š” ์‹ฌ๊ฐํ•œ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ์ดˆ๋ž˜ํ•ฉ๋‹ˆ๋‹ค. ๋ณธ ๋ฐ•์‚ฌ ํ•™์œ„ ๋…ผ๋ฌธ์—์„œ๋Š” ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ์ƒ์˜ ๋ฐ€๋„ ๊ธฐ๋ฐ˜ ํด๋Ÿฌ์Šคํ„ฐ๋ง์˜ ์„ฑ๋Šฅ ํ•œ๊ณ„๋ฅผ ๋‹ค๋ฃจ๋ฉฐ ๊ถ๊ทน์ ์œผ๋กœ ๋‘ ๊ฐ€์ง€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ฒซ ๋ฒˆ์งธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ธ DISC๋Š” ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ์ƒ์—์„œ DBSCAN๊ณผ ๋™์ผํ•œ ํด๋Ÿฌ์Šคํ„ฐ๋ง ๊ฒฐ๊ณผ๋ฅผ ์ฐพ๋Š” ์ ์ง„์  ๋ฐ€๋„ ๊ธฐ๋ฐ˜ ํด๋Ÿฌ์Šคํ„ฐ๋ง ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค. ํ•ด๋‹น ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ํด๋Ÿฌ์Šคํ„ฐ ์—…๋ฐ์ดํŠธ ์‹œ์— ๋ฐœ์ƒํ•˜๋Š” ์ค‘๋ณต ๋ฌธ์ œ๋“ค์— ์ดˆ์ ์„ ๋‘ก๋‹ˆ๋‹ค. ๋ฐ€๋„ ๊ธฐ๋ฐ˜ ํด๋Ÿฌ์Šคํ„ฐ๋ง์—์„œ๋Š” ์—ฌ๋Ÿฌ ๋ฐ์ดํ„ฐ ์ ๋“ค์„ ๊ฐœ๋ณ„์ ์œผ๋กœ ์‚ฝ์ž… ํ˜น์€ ์‚ญ์ œํ•  ๋•Œ ์ฃผ๋ณ€ ์ ๋“ค์„ ๋ถˆํ•„์š”ํ•˜๊ฒŒ ์ค‘๋ณต์ ์œผ๋กœ ํƒ์ƒ‰ํ•˜๊ณ  ํšŒ์ˆ˜ํ•ฉ๋‹ˆ๋‹ค. DISC ๋Š” ๋ฐฐ์น˜ ์—…๋ฐ์ดํŠธ๋กœ ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜์—ฌ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋ฉฐ ์—ฌ๋Ÿฌ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•๋“ค์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๋‘ ๋ฒˆ์งธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ธ DenForest ๋Š” ์‚ญ์ œ ๊ณผ์ •์— ์ดˆ์ ์„ ๋‘” ์ ์ง„์  ๋ฐ€๋„ ๊ธฐ๋ฐ˜ ํด๋Ÿฌ์Šคํ„ฐ๋ง ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค. ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ๊ทธ๋ž˜ํ”„๋กœ ๊ด€๋ฆฌํ•˜๋Š” ์ด์ „ ๋ฐฉ๋ฒ•๋“ค๊ณผ ๋‹ฌ๋ฆฌ DenForest ๋Š” ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ์‹ ์žฅ ํŠธ๋ฆฌ์˜ ๊ทธ๋ฃน์œผ๋กœ ๊ด€๋ฆฌํ•จ์œผ๋กœ์จ ํšจ์œจ์ ์ธ ์‚ญ์ œ ์„ฑ๋Šฅ์— ๊ธฐ์—ฌํ•ฉ๋‹ˆ๋‹ค. ๋‚˜์•„๊ฐ€ ๋ฐฐ์น˜ ์ตœ์ ํ™” ๊ธฐ๋ฒ•์„ ํ†ตํ•ด ์‚ฝ์ž… ์„ฑ๋Šฅ ํ–ฅ์ƒ์—๋„ ๊ธฐ์—ฌํ•ฉ๋‹ˆ๋‹ค. ๋‘ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํšจ์œจ์„ฑ์„ ์ž…์ฆํ•˜๊ธฐ ์œ„ํ•ด ๊ด‘๋ฒ”์œ„ํ•œ ํ‰๊ฐ€๋ฅผ ์ˆ˜ํ–‰ํ•˜์˜€์œผ๋ฉฐ DISC ๋ฐ DenForest ๋Š” ์ตœ์‹ ์˜ ๋ฐ€๋„ ๊ธฐ๋ฐ˜ ํด๋Ÿฌ์Šคํ„ฐ๋ง ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค๋ณด๋‹ค ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.1 Introduction 1 1.1 Overview of Dissertation 3 2 Related Works 7 2.1 Clustering 7 2.2 Density-Based Clustering for Static Datasets 8 2.2.1 Extension of DBSCAN 8 2.2.2 Approximation of Density-Based Clustering 9 2.2.3 Parallelization of Density-Based Clustering 10 2.3 Incremental Density-Based Clustering 10 2.3.1 Approximated Density-Based Clustering for Dynamic Datasets 11 2.4 Density-Based Clustering for Data Streams 11 2.4.1 Micro-clusters 12 2.4.2 Density-Based Clustering in Damped Window Model 12 2.4.3 Density-Based Clustering in Sliding Window Model 13 2.5 Non-Density-Based Clustering 14 2.5.1 Partitional Clustering and Hierarchical Clustering 14 2.5.2 Distribution-Based Clustering 15 2.5.3 High-Dimensional Data Clustering 15 2.5.4 Spectral Clustering 16 3 Background 17 3.1 DBSCAN 17 3.1.1 Reformulation of Density-Based Clustering 19 3.2 Incremental DBSCAN 20 3.3 Sliding Windows 22 3.3.1 Density-Based Clustering over Sliding Windows 23 3.3.2 Slow Deletion Problem 24 4 Avoiding Redundant Searches in Updating Clusters 26 4.1 The DISC Algorithm 27 4.1.1 Overview of DISC 27 4.1.2 COLLECT 29 4.1.3 CLUSTER 30 4.1.3.1 Splitting a Cluster 32 4.1.3.2 Merging Clusters 37 4.1.4 Horizontal Manner vs. Vertical Manner 38 4.2 Checking Reachability 39 4.2.1 Multi-Starter BFS 40 4.2.2 Epoch-Based Probing of R-tree Index 41 4.3 Updating Labels 43 5 Avoiding Graph Traversals in Updating Clusters 45 5.1 The DenForest Algorithm 46 5.1.1 Overview of DenForest 47 5.1.1.1 Supported Types of the Sliding Window Model 48 5.1.2 Nostalgic Core and Density-based Clusters 49 5.1.2.1 Cluster Membership of Border 51 5.1.3 DenTree 51 5.2 Operations of DenForest 54 5.2.1 Insertion 54 5.2.1.1 MST based on Link-Cut Tree 57 5.2.1.2 Time Complexity of Insert Operation 58 5.2.2 Deletion 59 5.2.2.1 Time Complexity of Delete Operation 61 5.2.3 Insertion/Deletion Examples 64 5.2.4 Cluster Membership 65 5.2.5 Batch-Optimized Update 65 5.3 Clustering Quality of DenForest 68 5.3.1 Clustering Quality for Static Data 68 5.3.2 Discussion 70 5.3.3 Replaceability 70 5.3.3.1 Nostalgic Cores and Density 71 5.3.3.2 Nostalgic Cores and Quality 72 5.3.4 1D Example 74 6 Evaluation 76 6.1 Real-World Datasets 76 6.2 Competing Methods 77 6.2.1 Exact Methods 77 6.2.2 Non-Exact Methods 77 6.3 Experimental Settings 78 6.4 Evaluation of DISC 78 6.4.1 Parameters 79 6.4.2 Baseline Evaluation 79 6.4.3 Drilled-Down Evaluation 82 6.4.3.1 Effects of Threshold Values 82 6.4.3.2 Insertions vs. Deletions 83 6.4.3.3 Range Searches 84 6.4.3.4 MS-BFS and Epoch-Based Probing 85 6.4.4 Comparison with Summarization/Approximation-Based Methods 86 6.5 Evaluation of DenForest 90 6.5.1 Parameters 90 6.5.2 Baseline Evaluation 91 6.5.3 Drilled-Down Evaluation 94 6.5.3.1 Varying Size of Window/Stride 94 6.5.3.2 Effect of Density and Distance Thresholds 95 6.5.3.3 Memory Usage 98 6.5.3.4 Clustering Quality over Sliding Windows 98 6.5.3.5 Clustering Quality under Various Density and Distance Thresholds 101 6.5.3.6 Relaxed Parameter Settings 102 6.5.4 Comparison with Summarization-Based Methods 102 7 Future Work: Extension to Varying/Relative Densities 105 8 Conclusion 107 Abstract (In Korean) 120๋ฐ•

    A Case Study of Clustering Algorithms for Categorical Data sets

    Get PDF
    The data clustering, an unsupervised pattern recognition process is the task of assigning a set of objects into groups called clusters so that the objects in the same cluster are more similar to each other than to those in other clusters. Most traditional clustering algorithms are limited to handlingย numerical data. However, these cannot be directly applied for clustering of nominal data, where domain values are discrete and have no ordering. In this paper various categorical data clustering algorithms are going to be addressed in detail. A detailed survey on existing algorithms will be made and the scalability of some of the existing algorithms will be examined

    The 7th Conference of PhD Students in Computer Science

    Get PDF

    Mining Aircraft Telemetry Data With Evolutionary Algorithms

    Get PDF
    The Ganged Phased Array Radar - Risk Mitigation System (GPAR-RMS) was a mobile ground-based sense-and-avoid system for Unmanned Aircraft System (UAS) operations developed by the University of North Dakota. GPAR-RMS detected proximate aircraft with various sensor systems, including a 2D radar and an Automatic Dependent Surveillance - Broadcast (ADS-B) receiver. Information about those aircraft was then displayed to UAS operators via visualization software developed by the University of North Dakota. The Risk Mitigation (RM) subsystem for GPAR-RMS was designed to estimate the current risk of midair collision, between the Unmanned Aircraft (UA) and a General Aviation (GA) aircraft flying under Visual Flight Rules (VFR) in the surrounding airspace, for UAS operations in Class E airspace (i.e. below 18,000 feet MSL). However, accurate probabilistic models for the behavior of pilots of GA aircraft flying under VFR in Class E airspace were needed before the RM subsystem could be implemented. In this dissertation the author presents the results of data mining an aircraft telemetry data set from a consecutive nine month period in 2011. This aircraft telemetry data set consisted of Flight Data Monitoring (FDM) data obtained from Garmin G1000 devices onboard every Cessna 172 in the University of North Dakota\u27s training fleet. Data from aircraft which were potentially within the controlled airspace surrounding controlled airports were excluded. Also, GA aircraft in the FDM data flying in Class E airspace were assumed to be flying under VFR, which is usually a valid assumption. Complex subpaths were discovered from the aircraft telemetry data set using a novel application of an ant colony algorithm. Then, probabilistic models were data mined from those subpaths using extensions of the Genetic K-Means (GKA) and Expectation- Maximization (EM) algorithms. The results obtained from the subpath discovery and data mining suggest a pilot flying a GA aircraft near to an uncontrolled airport will perform different maneuvers than a pilot flying a GA aircraft far from an uncontrolled airport, irrespective of the altitude of the GA aircraft. However, since only aircraft telemetry data from the University of North Dakota\u27s training fleet were data mined, these results are not likely to be applicable to GA aircraft operating in a non-training environment

    Meta-optimizations for Cluster Analysis

    Get PDF
    This dissertation thesis deals with advances in the automation of cluster analysis.This dissertation thesis deals with advances in the automation of cluster analysis
    • โ€ฆ
    corecore