336 research outputs found

    μŠ¬λΌμ΄λ”© μœˆλ„μš°μƒμ˜ λΉ λ₯Έ 점진적 밀도 기반 ν΄λŸ¬μŠ€ν„°λ§

    Get PDF
    ν•™μœ„λ…Όλ¬Έ(박사) -- μ„œμšΈλŒ€ν•™κ΅λŒ€ν•™μ› : κ³΅κ³ΌλŒ€ν•™ 컴퓨터곡학뢀, 2022. 8. 문봉기.Given the prevalence of mobile and IoT devices, continuous clustering against streaming data has become an essential tool of increasing importance for data analytics. Among many clustering approaches, density-based clustering has garnered much attention due to its unique advantage that it can detect clusters of an arbitrary shape when noise exists. However, when the clusters need to be updated continuously along with an evolving input dataset, a relatively high computational cost is required. Particularly, deleting data points from the clusters causes severe performance degradation. In this dissertation, the performance limits of the incremental density-based clustering over sliding windows are addressed. Ultimately, two algorithms, DISC and DenForest, are proposed. The first algorithm DISC is an incremental density-based clustering algorithm that efficiently produces the same clustering results as DBSCAN over sliding windows. It focuses on redundancy issues that occur when updating clusters. When multiple data points are inserted or deleted individually, surrounding data points are explored and retrieved redundantly. DISC addresses these issues and improves the performance by updating multiple points in a batch. It also presents several optimization techniques. The second algorithm DenForest is an incremental density-based clustering algorithm that primarily focuses on the deletion process. Unlike previous methods that manage clusters as a graph, DenForest manages clusters as a group of spanning trees, which contributes to very efficient deletion performance. Moreover, it provides a batch-optimized technique to improve the insertion performance. To prove the effectiveness of the two algorithms, extensive evaluations were conducted, and it is demonstrated that DISC and DenForest outperform the state-of-the-art density-based clustering algorithms significantly.λͺ¨λ°”일 및 IoT μž₯μΉ˜κ°€ 널리 보급됨에 따라 슀트리밍 λ°μ΄ν„°μƒμ—μ„œ μ§€μ†μ μœΌλ‘œ ν΄λŸ¬μŠ€ν„°λ§ μž‘μ—…μ„ μˆ˜ν–‰ν•˜λŠ” 것은 데이터 λΆ„μ„μ—μ„œ 점점 더 μ€‘μš”ν•΄μ§€λŠ” ν•„μˆ˜ 도ꡬ가 λ˜μ—ˆμŠ΅λ‹ˆλ‹€. λ§Žμ€ ν΄λŸ¬μŠ€ν„°λ§ 방법 μ€‘μ—μ„œ 밀도 기반 ν΄λŸ¬μŠ€ν„°λ§μ€ λ…Έμ΄μ¦ˆκ°€ μ‘΄μž¬ν•  λ•Œ μž„μ˜μ˜ λͺ¨μ–‘μ˜ ν΄λŸ¬μŠ€ν„°λ₯Ό 감지할 수 μžˆλ‹€λŠ” κ³ μœ ν•œ μž₯점을 가지고 있으며 이에 따라 λ§Žμ€ 관심을 λ°›μ•˜μŠ΅λ‹ˆλ‹€. κ·ΈλŸ¬λ‚˜ 밀도 기반 ν΄λŸ¬μŠ€ν„°λ§μ€ λ³€ν™”ν•˜λŠ” μž…λ ₯ 데이터 셋에 따라 μ§€μ†μ μœΌλ‘œ ν΄λŸ¬μŠ€ν„°λ₯Ό μ—…λ°μ΄νŠΈν•΄μ•Ό ν•˜λŠ” 경우 비ꡐ적 높은 계산 λΉ„μš©μ΄ ν•„μš”ν•©λ‹ˆλ‹€. 특히, ν΄λŸ¬μŠ€ν„°μ—μ„œμ˜ 데이터 μ λ“€μ˜ μ‚­μ œλŠ” μ‹¬κ°ν•œ μ„±λŠ₯ μ €ν•˜λ₯Ό μ΄ˆλž˜ν•©λ‹ˆλ‹€. λ³Έ 박사 ν•™μœ„ λ…Όλ¬Έμ—μ„œλŠ” μŠ¬λΌμ΄λ”© μœˆλ„μš°μƒμ˜ 밀도 기반 ν΄λŸ¬μŠ€ν„°λ§μ˜ μ„±λŠ₯ ν•œκ³„λ₯Ό 닀루며 ꢁ극적으둜 두 가지 μ•Œκ³ λ¦¬μ¦˜μ„ μ œμ•ˆν•©λ‹ˆλ‹€. 첫 번째 μ•Œκ³ λ¦¬μ¦˜μΈ DISCλŠ” μŠ¬λΌμ΄λ”© μœˆλ„μš°μƒμ—μ„œ DBSCANκ³Ό λ™μΌν•œ ν΄λŸ¬μŠ€ν„°λ§ κ²°κ³Όλ₯Ό μ°ΎλŠ” 점진적 밀도 기반 ν΄λŸ¬μŠ€ν„°λ§ μ•Œκ³ λ¦¬μ¦˜μž…λ‹ˆλ‹€. ν•΄λ‹Ή μ•Œκ³ λ¦¬μ¦˜μ€ ν΄λŸ¬μŠ€ν„° μ—…λ°μ΄νŠΈ μ‹œμ— λ°œμƒν•˜λŠ” 쀑볡 λ¬Έμ œλ“€μ— μ΄ˆμ μ„ λ‘‘λ‹ˆλ‹€. 밀도 기반 ν΄λŸ¬μŠ€ν„°λ§μ—μ„œλŠ” μ—¬λŸ¬ 데이터 점듀을 κ°œλ³„μ μœΌλ‘œ μ‚½μž… ν˜Ήμ€ μ‚­μ œν•  λ•Œ μ£Όλ³€ 점듀을 λΆˆν•„μš”ν•˜κ²Œ μ€‘λ³΅μ μœΌλ‘œ νƒμƒ‰ν•˜κ³  νšŒμˆ˜ν•©λ‹ˆλ‹€. DISC λŠ” 배치 μ—…λ°μ΄νŠΈλ‘œ 이 문제λ₯Ό ν•΄κ²°ν•˜μ—¬ μ„±λŠ₯을 ν–₯μƒμ‹œν‚€λ©° μ—¬λŸ¬ μ΅œμ ν™” 방법듀을 μ œμ•ˆν•©λ‹ˆλ‹€. 두 번째 μ•Œκ³ λ¦¬μ¦˜μΈ DenForest λŠ” μ‚­μ œ 과정에 μ΄ˆμ μ„ λ‘” 점진적 밀도 기반 ν΄λŸ¬μŠ€ν„°λ§ μ•Œκ³ λ¦¬μ¦˜μž…λ‹ˆλ‹€. ν΄λŸ¬μŠ€ν„°λ₯Ό κ·Έλž˜ν”„λ‘œ κ΄€λ¦¬ν•˜λŠ” 이전 방법듀과 달리 DenForest λŠ” ν΄λŸ¬μŠ€ν„°λ₯Ό μ‹ μž₯ 트리의 그룹으둜 κ΄€λ¦¬ν•¨μœΌλ‘œμ¨ 효율적인 μ‚­μ œ μ„±λŠ₯에 κΈ°μ—¬ν•©λ‹ˆλ‹€. λ‚˜μ•„κ°€ 배치 μ΅œμ ν™” 기법을 톡해 μ‚½μž… μ„±λŠ₯ ν–₯상에도 κΈ°μ—¬ν•©λ‹ˆλ‹€. 두 μ•Œκ³ λ¦¬μ¦˜μ˜ νš¨μœ¨μ„±μ„ μž…μ¦ν•˜κΈ° μœ„ν•΄ κ΄‘λ²”μœ„ν•œ 평가λ₯Ό μˆ˜ν–‰ν•˜μ˜€μœΌλ©° DISC 및 DenForest λŠ” μ΅œμ‹ μ˜ 밀도 기반 ν΄λŸ¬μŠ€ν„°λ§ μ•Œκ³ λ¦¬μ¦˜λ“€λ³΄λ‹€ λ›°μ–΄λ‚œ μ„±λŠ₯을 λ³΄μ—¬μ£Όμ—ˆμŠ΅λ‹ˆλ‹€.1 Introduction 1 1.1 Overview of Dissertation 3 2 Related Works 7 2.1 Clustering 7 2.2 Density-Based Clustering for Static Datasets 8 2.2.1 Extension of DBSCAN 8 2.2.2 Approximation of Density-Based Clustering 9 2.2.3 Parallelization of Density-Based Clustering 10 2.3 Incremental Density-Based Clustering 10 2.3.1 Approximated Density-Based Clustering for Dynamic Datasets 11 2.4 Density-Based Clustering for Data Streams 11 2.4.1 Micro-clusters 12 2.4.2 Density-Based Clustering in Damped Window Model 12 2.4.3 Density-Based Clustering in Sliding Window Model 13 2.5 Non-Density-Based Clustering 14 2.5.1 Partitional Clustering and Hierarchical Clustering 14 2.5.2 Distribution-Based Clustering 15 2.5.3 High-Dimensional Data Clustering 15 2.5.4 Spectral Clustering 16 3 Background 17 3.1 DBSCAN 17 3.1.1 Reformulation of Density-Based Clustering 19 3.2 Incremental DBSCAN 20 3.3 Sliding Windows 22 3.3.1 Density-Based Clustering over Sliding Windows 23 3.3.2 Slow Deletion Problem 24 4 Avoiding Redundant Searches in Updating Clusters 26 4.1 The DISC Algorithm 27 4.1.1 Overview of DISC 27 4.1.2 COLLECT 29 4.1.3 CLUSTER 30 4.1.3.1 Splitting a Cluster 32 4.1.3.2 Merging Clusters 37 4.1.4 Horizontal Manner vs. Vertical Manner 38 4.2 Checking Reachability 39 4.2.1 Multi-Starter BFS 40 4.2.2 Epoch-Based Probing of R-tree Index 41 4.3 Updating Labels 43 5 Avoiding Graph Traversals in Updating Clusters 45 5.1 The DenForest Algorithm 46 5.1.1 Overview of DenForest 47 5.1.1.1 Supported Types of the Sliding Window Model 48 5.1.2 Nostalgic Core and Density-based Clusters 49 5.1.2.1 Cluster Membership of Border 51 5.1.3 DenTree 51 5.2 Operations of DenForest 54 5.2.1 Insertion 54 5.2.1.1 MST based on Link-Cut Tree 57 5.2.1.2 Time Complexity of Insert Operation 58 5.2.2 Deletion 59 5.2.2.1 Time Complexity of Delete Operation 61 5.2.3 Insertion/Deletion Examples 64 5.2.4 Cluster Membership 65 5.2.5 Batch-Optimized Update 65 5.3 Clustering Quality of DenForest 68 5.3.1 Clustering Quality for Static Data 68 5.3.2 Discussion 70 5.3.3 Replaceability 70 5.3.3.1 Nostalgic Cores and Density 71 5.3.3.2 Nostalgic Cores and Quality 72 5.3.4 1D Example 74 6 Evaluation 76 6.1 Real-World Datasets 76 6.2 Competing Methods 77 6.2.1 Exact Methods 77 6.2.2 Non-Exact Methods 77 6.3 Experimental Settings 78 6.4 Evaluation of DISC 78 6.4.1 Parameters 79 6.4.2 Baseline Evaluation 79 6.4.3 Drilled-Down Evaluation 82 6.4.3.1 Effects of Threshold Values 82 6.4.3.2 Insertions vs. Deletions 83 6.4.3.3 Range Searches 84 6.4.3.4 MS-BFS and Epoch-Based Probing 85 6.4.4 Comparison with Summarization/Approximation-Based Methods 86 6.5 Evaluation of DenForest 90 6.5.1 Parameters 90 6.5.2 Baseline Evaluation 91 6.5.3 Drilled-Down Evaluation 94 6.5.3.1 Varying Size of Window/Stride 94 6.5.3.2 Effect of Density and Distance Thresholds 95 6.5.3.3 Memory Usage 98 6.5.3.4 Clustering Quality over Sliding Windows 98 6.5.3.5 Clustering Quality under Various Density and Distance Thresholds 101 6.5.3.6 Relaxed Parameter Settings 102 6.5.4 Comparison with Summarization-Based Methods 102 7 Future Work: Extension to Varying/Relative Densities 105 8 Conclusion 107 Abstract (In Korean) 120λ°•

    Deep Cellular Recurrent Neural Architecture for Efficient Multidimensional Time-Series Data Processing

    Get PDF
    Efficient processing of time series data is a fundamental yet challenging problem in pattern recognition. Though recent developments in machine learning and deep learning have enabled remarkable improvements in processing large scale datasets in many application domains, most are designed and regulated to handle inputs that are static in time. Many real-world data, such as in biomedical, surveillance and security, financial, manufacturing and engineering applications, are rarely static in time, and demand models able to recognize patterns in both space and time. Current machine learning (ML) and deep learning (DL) models adapted for time series processing tend to grow in complexity and size to accommodate the additional dimensionality of time. Specifically, the biologically inspired learning based models known as artificial neural networks that have shown extraordinary success in pattern recognition, tend to grow prohibitively large and cumbersome in the presence of large scale multi-dimensional time series biomedical data such as EEG. Consequently, this work aims to develop representative ML and DL models for robust and efficient large scale time series processing. First, we design a novel ML pipeline with efficient feature engineering to process a large scale multi-channel scalp EEG dataset for automated detection of epileptic seizures. With the use of a sophisticated yet computationally efficient time-frequency analysis technique known as harmonic wavelet packet transform and an efficient self-similarity computation based on fractal dimension, we achieve state-of-the-art performance for automated seizure detection in EEG data. Subsequently, we investigate the development of a novel efficient deep recurrent learning model for large scale time series processing. For this, we first study the functionality and training of a biologically inspired neural network architecture known as cellular simultaneous recurrent neural network (CSRN). We obtain a generalization of this network for multiple topological image processing tasks and investigate the learning efficacy of the complex cellular architecture using several state-of-the-art training methods. Finally, we develop a novel deep cellular recurrent neural network (CDRNN) architecture based on the biologically inspired distributed processing used in CSRN for processing time series data. The proposed DCRNN leverages the cellular recurrent architecture to promote extensive weight sharing and efficient, individualized, synchronous processing of multi-source time series data. Experiments on a large scale multi-channel scalp EEG, and a machine fault detection dataset show that the proposed DCRNN offers state-of-the-art recognition performance while using substantially fewer trainable recurrent units

    Robot training using system identification

    Get PDF
    This paper focuses on developing a formal, theory-based design methodology to generate transparent robot control programs using mathematical functions. The research finds its theoretical roots in robot training and system identification techniques such as Armax (Auto-Regressive Moving Average models with eXogenous inputs) and Narmax (Non-linear Armax). These techniques produce linear and non-linear polynomial functions that model the relationship between a robot’s sensor perception and motor response. The main benefits of the proposed design methodology, compared to the traditional robot programming techniques are: (i) It is a fast and efficient way of generating robot control code, (ii) The generated robot control programs are transparent mathematical functions that can be used to form hypotheses and theoretical analyses of robot behaviour, and (iii) It requires very little explicit knowledge of robot programming where end-users/programmers who do not have any specialised robot programming skills can nevertheless generate task-achieving sensor-motor couplings. The nature of this research is concerned with obtaining sensor-motor couplings, be it through human demonstration via the robot, direct human demonstration, or other means. The viability of our methodology has been demonstrated by teaching various mobile robots different sensor-motor tasks such as wall following, corridor passing, door traversal and route learning

    HPTA: High-Performance Text Analytics

    Get PDF

    Recurrent Neural Networks and Matrix Methods for Cognitive Radio Spectrum Prediction and Security

    Get PDF
    In this work, machine learning tools, including recurrent neural networks (RNNs), matrix completion, and non-negative matrix factorization (NMF), are used for cognitive radio problems. Specifically addressed are a missing data problem and a blind signal separation problem. A specialized RNN called Cellular Simultaneous Recurrent Network (CSRN), typically used in image processing applications, has been modified. The CRSN performs well for spatial spectrum prediction of radio signals with missing data. An algorithm called soft-impute for matrix completion used together with an RNN performs well for missing data problems in the radio spectrum time-frequency domain. Estimating missing spectrum data can improve cognitive radio efficiency. An NMF method called tuning pruning is used for blind source separation of radio signals in simulation. An NMF optimization technique using a geometric constraint is proposed to limit the solution space of blind signal separation. Both NMF methods are promising in addressing a security problem known as spectrum sensing data falsification attack

    Unsupervised discovery of temporal sequences in high-dimensional datasets, with applications to neuroscience.

    Get PDF
    Identifying low-dimensional features that describe large-scale neural recordings is a major challenge in neuroscience. Repeated temporal patterns (sequences) are thought to be a salient feature of neural dynamics, but are not succinctly captured by traditional dimensionality reduction techniques. Here, we describe a software toolbox-called seqNMF-with new methods for extracting informative, non-redundant, sequences from high-dimensional neural data, testing the significance of these extracted patterns, and assessing the prevalence of sequential structure in data. We test these methods on simulated data under multiple noise conditions, and on several real neural and behavioral datas. In hippocampal data, seqNMF identifies neural sequences that match those calculated manually by reference to behavioral events. In songbird data, seqNMF discovers neural sequences in untutored birds that lack stereotyped songs. Thus, by identifying temporal structure directly from neural data, seqNMF enables dissection of complex neural circuits without relying on temporal references from stimuli or behavioral outputs
    • …
    corecore