2,582 research outputs found
DRSP : Dimension Reduction For Similarity Matching And Pruning Of Time Series Data Streams
Similarity matching and join of time series data streams has gained a lot of
relevance in today's world that has large streaming data. This process finds
wide scale application in the areas of location tracking, sensor networks,
object positioning and monitoring to name a few. However, as the size of the
data stream increases, the cost involved to retain all the data in order to aid
the process of similarity matching also increases. We develop a novel framework
to addresses the following objectives. Firstly, Dimension reduction is
performed in the preprocessing stage, where large stream data is segmented and
reduced into a compact representation such that it retains all the crucial
information by a technique called Multi-level Segment Means (MSM). This reduces
the space complexity associated with the storage of large time-series data
streams. Secondly, it incorporates effective Similarity Matching technique to
analyze if the new data objects are symmetric to the existing data stream. And
finally, the Pruning Technique that filters out the pseudo data object pairs
and join only the relevant pairs. The computational cost for MSM is O(l*ni) and
the cost for pruning is O(DRF*wsize*d), where DRF is the Dimension Reduction
Factor. We have performed exhaustive experimental trials to show that the
proposed framework is both efficient and competent in comparison with earlier
works.Comment: 20 pages,8 figures, 6 Table
When Things Matter: A Data-Centric View of the Internet of Things
With the recent advances in radio-frequency identification (RFID), low-cost
wireless sensor devices, and Web technologies, the Internet of Things (IoT)
approach has gained momentum in connecting everyday objects to the Internet and
facilitating machine-to-human and machine-to-machine communication with the
physical world. While IoT offers the capability to connect and integrate both
digital and physical entities, enabling a whole new class of applications and
services, several significant challenges need to be addressed before these
applications and services can be fully realized. A fundamental challenge
centers around managing IoT data, typically produced in dynamic and volatile
environments, which is not only extremely large in scale and volume, but also
noisy, and continuous. This article surveys the main techniques and
state-of-the-art research efforts in IoT from data-centric perspectives,
including data stream processing, data storage models, complex event
processing, and searching in IoT. Open research issues for IoT data management
are also discussed
슬라이딩 윈도우상의 빠른 점진적 밀도 기반 클러스터링
학위논문(박사) -- 서울대학교대학원 : 공과대학 컴퓨터공학부, 2022. 8. 문봉기.Given the prevalence of mobile and IoT devices, continuous clustering against streaming data has become an essential tool of increasing importance for data analytics. Among many clustering approaches, density-based clustering has garnered much attention due to its unique advantage that it can detect clusters of an arbitrary shape when noise exists. However, when the clusters need to be updated continuously along with an evolving input dataset, a relatively high computational cost is required. Particularly, deleting data points from the clusters causes severe performance degradation.
In this dissertation, the performance limits of the incremental density-based clustering over sliding windows are addressed. Ultimately, two algorithms, DISC and DenForest, are proposed. The first algorithm DISC is an incremental density-based clustering algorithm that efficiently produces the same clustering results as DBSCAN over sliding windows. It focuses on redundancy issues that occur when updating clusters. When multiple data points are inserted or deleted individually, surrounding data points are explored and retrieved redundantly. DISC addresses these issues and improves the performance by updating multiple points in a batch. It also presents several optimization techniques. The second algorithm DenForest is an incremental density-based clustering algorithm that primarily focuses on the deletion process. Unlike previous methods that manage clusters as a graph, DenForest manages clusters as a group of spanning trees, which contributes to very efficient deletion performance. Moreover, it provides a batch-optimized technique to improve the insertion performance. To prove the effectiveness of the two algorithms, extensive evaluations were conducted, and it is demonstrated that DISC and DenForest outperform the state-of-the-art density-based clustering algorithms significantly.모바일 및 IoT 장치가 널리 보급됨에 따라 스트리밍 데이터상에서 지속적으로 클러스터링 작업을 수행하는 것은 데이터 분석에서 점점 더 중요해지는 필수 도구가 되었습니다. 많은 클러스터링 방법 중에서 밀도 기반 클러스터링은 노이즈가 존재할 때 임의의 모양의 클러스터를 감지할 수 있다는 고유한 장점을 가지고 있으며 이에 따라 많은 관심을 받았습니다. 그러나 밀도 기반 클러스터링은 변화하는 입력 데이터 셋에 따라 지속적으로 클러스터를 업데이트해야 하는 경우 비교적 높은 계산 비용이 필요합니다. 특히, 클러스터에서의 데이터 점들의 삭제는 심각한 성능 저하를 초래합니다.
본 박사 학위 논문에서는 슬라이딩 윈도우상의 밀도 기반 클러스터링의 성능 한계를 다루며 궁극적으로 두 가지 알고리즘을 제안합니다. 첫 번째 알고리즘인 DISC는 슬라이딩 윈도우상에서 DBSCAN과 동일한 클러스터링 결과를 찾는 점진적 밀도 기반 클러스터링 알고리즘입니다. 해당 알고리즘은 클러스터 업데이트 시에 발생하는 중복 문제들에 초점을 둡니다. 밀도 기반 클러스터링에서는 여러 데이터 점들을 개별적으로 삽입 혹은 삭제할 때 주변 점들을 불필요하게 중복적으로 탐색하고 회수합니다. DISC 는 배치 업데이트로 이 문제를 해결하여 성능을 향상시키며 여러 최적화 방법들을 제안합니다. 두 번째 알고리즘인 DenForest 는 삭제 과정에 초점을 둔 점진적 밀도 기반 클러스터링 알고리즘입니다. 클러스터를 그래프로 관리하는 이전 방법들과 달리 DenForest 는 클러스터를 신장 트리의 그룹으로 관리함으로써 효율적인 삭제 성능에 기여합니다. 나아가 배치 최적화 기법을 통해 삽입 성능 향상에도 기여합니다. 두 알고리즘의 효율성을 입증하기 위해 광범위한 평가를 수행하였으며 DISC 및 DenForest 는 최신의 밀도 기반 클러스터링 알고리즘들보다 뛰어난 성능을 보여주었습니다.1 Introduction 1
1.1 Overview of Dissertation 3
2 Related Works 7
2.1 Clustering 7
2.2 Density-Based Clustering for Static Datasets 8
2.2.1 Extension of DBSCAN 8
2.2.2 Approximation of Density-Based Clustering 9
2.2.3 Parallelization of Density-Based Clustering 10
2.3 Incremental Density-Based Clustering 10
2.3.1 Approximated Density-Based Clustering for Dynamic Datasets 11
2.4 Density-Based Clustering for Data Streams 11
2.4.1 Micro-clusters 12
2.4.2 Density-Based Clustering in Damped Window Model 12
2.4.3 Density-Based Clustering in Sliding Window Model 13
2.5 Non-Density-Based Clustering 14
2.5.1 Partitional Clustering and Hierarchical Clustering 14
2.5.2 Distribution-Based Clustering 15
2.5.3 High-Dimensional Data Clustering 15
2.5.4 Spectral Clustering 16
3 Background 17
3.1 DBSCAN 17
3.1.1 Reformulation of Density-Based Clustering 19
3.2 Incremental DBSCAN 20
3.3 Sliding Windows 22
3.3.1 Density-Based Clustering over Sliding Windows 23
3.3.2 Slow Deletion Problem 24
4 Avoiding Redundant Searches in Updating Clusters 26
4.1 The DISC Algorithm 27
4.1.1 Overview of DISC 27
4.1.2 COLLECT 29
4.1.3 CLUSTER 30
4.1.3.1 Splitting a Cluster 32
4.1.3.2 Merging Clusters 37
4.1.4 Horizontal Manner vs. Vertical Manner 38
4.2 Checking Reachability 39
4.2.1 Multi-Starter BFS 40
4.2.2 Epoch-Based Probing of R-tree Index 41
4.3 Updating Labels 43
5 Avoiding Graph Traversals in Updating Clusters 45
5.1 The DenForest Algorithm 46
5.1.1 Overview of DenForest 47
5.1.1.1 Supported Types of the Sliding Window Model 48
5.1.2 Nostalgic Core and Density-based Clusters 49
5.1.2.1 Cluster Membership of Border 51
5.1.3 DenTree 51
5.2 Operations of DenForest 54
5.2.1 Insertion 54
5.2.1.1 MST based on Link-Cut Tree 57
5.2.1.2 Time Complexity of Insert Operation 58
5.2.2 Deletion 59
5.2.2.1 Time Complexity of Delete Operation 61
5.2.3 Insertion/Deletion Examples 64
5.2.4 Cluster Membership 65
5.2.5 Batch-Optimized Update 65
5.3 Clustering Quality of DenForest 68
5.3.1 Clustering Quality for Static Data 68
5.3.2 Discussion 70
5.3.3 Replaceability 70
5.3.3.1 Nostalgic Cores and Density 71
5.3.3.2 Nostalgic Cores and Quality 72
5.3.4 1D Example 74
6 Evaluation 76
6.1 Real-World Datasets 76
6.2 Competing Methods 77
6.2.1 Exact Methods 77
6.2.2 Non-Exact Methods 77
6.3 Experimental Settings 78
6.4 Evaluation of DISC 78
6.4.1 Parameters 79
6.4.2 Baseline Evaluation 79
6.4.3 Drilled-Down Evaluation 82
6.4.3.1 Effects of Threshold Values 82
6.4.3.2 Insertions vs. Deletions 83
6.4.3.3 Range Searches 84
6.4.3.4 MS-BFS and Epoch-Based Probing 85
6.4.4 Comparison with Summarization/Approximation-Based Methods 86
6.5 Evaluation of DenForest 90
6.5.1 Parameters 90
6.5.2 Baseline Evaluation 91
6.5.3 Drilled-Down Evaluation 94
6.5.3.1 Varying Size of Window/Stride 94
6.5.3.2 Effect of Density and Distance Thresholds 95
6.5.3.3 Memory Usage 98
6.5.3.4 Clustering Quality over Sliding Windows 98
6.5.3.5 Clustering Quality under Various Density and Distance Thresholds 101
6.5.3.6 Relaxed Parameter Settings 102
6.5.4 Comparison with Summarization-Based Methods 102
7 Future Work: Extension to Varying/Relative Densities 105
8 Conclusion 107
Abstract (In Korean) 120박
A Data-driven Methodology Towards Mobility- and Traffic-related Big Spatiotemporal Data Frameworks
Human population is increasing at unprecedented rates, particularly in urban areas. This increase, along with the rise of a more economically empowered middle class, brings new and complex challenges to the mobility of people within urban areas. To tackle such challenges, transportation and mobility authorities and operators are trying to adopt innovative Big Data-driven Mobility- and Traffic-related solutions. Such solutions will help decision-making processes that aim to ease the load on an already overloaded transport infrastructure. The information collected from day-to-day mobility and traffic can help to mitigate some of such mobility challenges in urban areas.
Road infrastructure and traffic management operators (RITMOs) face several limitations to effectively extract value from the exponentially growing volumes of mobility- and traffic-related Big Spatiotemporal Data (MobiTrafficBD) that are being acquired and gathered. Research about the topics of Big Data, Spatiotemporal Data and specially MobiTrafficBD is scattered, and existing literature does not offer a concrete, common methodological approach to setup, configure, deploy and use a complete Big Data-based framework to manage the lifecycle of mobility-related spatiotemporal data, mainly focused on geo-referenced time series (GRTS) and spatiotemporal events (ST Events), extract value from it and support decision-making
processes of RITMOs.
This doctoral thesis proposes a data-driven, prescriptive methodological approach towards the design, development and deployment of MobiTrafficBD Frameworks focused on GRTS and ST Events. Besides a thorough literature review on Spatiotemporal Data, Big Data and the merging of these two fields through MobiTraffiBD, the methodological approach comprises a set of general characteristics, technical requirements, logical components, data flows and technological infrastructure models, as well as guidelines and best practices that aim to guide researchers, practitioners and stakeholders, such as RITMOs, throughout the design, development and deployment phases of any MobiTrafficBD Framework.
This work is intended to be a supporting methodological guide, based on widely used
Reference Architectures and guidelines for Big Data, but enriched with inherent characteristics
and concerns brought about by Big Spatiotemporal Data, such as in the case of GRTS and ST
Events. The proposed methodology was evaluated and demonstrated in various real-world
use cases that deployed MobiTrafficBD-based Data Management, Processing, Analytics and
Visualisation methods, tools and technologies, under the umbrella of several research projects
funded by the European Commission and the Portuguese Government.A população humana cresce a um ritmo sem precedentes, particularmente nas áreas urbanas.
Este aumento, aliado ao robustecimento de uma classe média com maior poder económico,
introduzem novos e complexos desafios na mobilidade de pessoas em áreas urbanas. Para
abordar estes desafios, autoridades e operadores de transportes e mobilidade estão a adotar
soluções inovadoras no domínio dos sistemas de Dados em Larga Escala nos domínios da
Mobilidade e Tráfego. Estas soluções irão apoiar os processos de decisão com o intuito de libertar uma infraestrutura de estradas e transportes já sobrecarregada. A informação colecionada da mobilidade diária e da utilização da infraestrutura de estradas pode ajudar na mitigação de alguns dos desafios da mobilidade urbana.
Os operadores de gestão de trânsito e de infraestruturas de estradas (em inglês, road infrastructure and traffic management operators — RITMOs) estão limitados no que toca a extrair valor de um sempre crescente volume de Dados Espaciotemporais em Larga Escala no domínio da Mobilidade e Tráfego (em inglês, Mobility- and Traffic-related Big Spatiotemporal Data —MobiTrafficBD) que estão a ser colecionados e recolhidos. Os trabalhos de investigação sobre os tópicos de Big Data, Dados Espaciotemporais e, especialmente, de MobiTrafficBD, estão dispersos, e a literatura existente não oferece uma metodologia comum e concreta para preparar, configurar, implementar e usar uma plataforma (framework) baseada em tecnologias Big Data para gerir o ciclo de vida de dados espaciotemporais em larga escala, com ênfase nas série temporais georreferenciadas (em inglês, geo-referenced time series — GRTS) e eventos espacio-
temporais (em inglês, spatiotemporal events — ST Events), extrair valor destes dados e apoiar os
RITMOs nos seus processos de decisão.
Esta dissertação doutoral propõe uma metodologia prescritiva orientada a dados, para o design, desenvolvimento e implementação de plataformas de MobiTrafficBD, focadas em GRTS e ST Events. Além de uma revisão de literatura completa nas áreas de Dados Espaciotemporais, Big Data e na junção destas áreas através do conceito de MobiTrafficBD, a metodologia proposta contem um conjunto de características gerais, requisitos técnicos, componentes lógicos, fluxos de dados e modelos de infraestrutura tecnológica, bem como diretrizes e boas
práticas para investigadores, profissionais e outras partes interessadas, como RITMOs, com o
objetivo de guiá-los pelas fases de design, desenvolvimento e implementação de qualquer pla-
taforma MobiTrafficBD.
Este trabalho deve ser visto como um guia metodológico de suporte, baseado em Arqui-
teturas de Referência e diretrizes amplamente utilizadas, mas enriquecido com as característi-
cas e assuntos implícitos relacionados com Dados Espaciotemporais em Larga Escala, como
no caso de GRTS e ST Events. A metodologia proposta foi avaliada e demonstrada em vários
cenários reais no âmbito de projetos de investigação financiados pela Comissão Europeia e
pelo Governo português, nos quais foram implementados métodos, ferramentas e tecnologias
nas áreas de Gestão de Dados, Processamento de Dados e Ciência e Visualização de Dados em
plataformas MobiTrafficB
A study of two problems in data mining: anomaly monitoring and privacy preservation.
Bu, Yingyi.Thesis (M.Phil.)--Chinese University of Hong Kong, 2008.Includes bibliographical references (leaves 89-94).Abstracts in English and Chinese.Abstract --- p.iAcknowledgement --- p.vChapter 1 --- Introduction --- p.1Chapter 1.1 --- Anomaly Monitoring --- p.1Chapter 1.2 --- Privacy Preservation --- p.5Chapter 1.2.1 --- Motivation --- p.7Chapter 1.2.2 --- Contribution --- p.12Chapter 2 --- Anomaly Monitoring --- p.16Chapter 2.1 --- Problem Statement --- p.16Chapter 2.2 --- A Preliminary Solution: Simple Pruning --- p.19Chapter 2.3 --- Efficient Monitoring by Local Clusters --- p.21Chapter 2.3.1 --- Incremental Local Clustering --- p.22Chapter 2.3.2 --- Batch Monitoring by Cluster Join --- p.24Chapter 2.3.3 --- Cost Analysis and Optimization --- p.28Chapter 2.4 --- Piecewise Index and Query Reschedule --- p.31Chapter 2.4.1 --- Piecewise VP-trees --- p.32Chapter 2.4.2 --- Candidate Rescheduling --- p.35Chapter 2.4.3 --- Cost Analysis --- p.36Chapter 2.5 --- Upper Bound Lemma: For Dynamic Time Warping Distance --- p.37Chapter 2.6 --- Experimental Evaluations --- p.39Chapter 2.6.1 --- Effectiveness --- p.40Chapter 2.6.2 --- Efficiency --- p.46Chapter 2.7 --- Related Work --- p.49Chapter 3 --- Privacy Preservation --- p.52Chapter 3.1 --- Problem Definition --- p.52Chapter 3.2 --- HD-Composition --- p.58Chapter 3.2.1 --- Role-based Partition --- p.59Chapter 3.2.2 --- Cohort-based Partition --- p.61Chapter 3.2.3 --- Privacy Guarantee --- p.70Chapter 3.2.4 --- Refinement of HD-composition --- p.75Chapter 3.2.5 --- Anonymization Algorithm --- p.76Chapter 3.3 --- Experiments --- p.77Chapter 3.3.1 --- Failures of Conventional Generalizations --- p.78Chapter 3.3.2 --- Evaluations of HD-Composition --- p.79Chapter 3.4 --- Related Work --- p.85Chapter 4 --- Conclusions --- p.87Bibliography --- p.8
A study of two problems in data mining: anomaly monitoring and privacy preservation.
Bu, Yingyi.Thesis (M.Phil.)--Chinese University of Hong Kong, 2008.Includes bibliographical references (leaves 89-94).Abstracts in English and Chinese.Abstract --- p.iAcknowledgement --- p.vChapter 1 --- Introduction --- p.1Chapter 1.1 --- Anomaly Monitoring --- p.1Chapter 1.2 --- Privacy Preservation --- p.5Chapter 1.2.1 --- Motivation --- p.7Chapter 1.2.2 --- Contribution --- p.12Chapter 2 --- Anomaly Monitoring --- p.16Chapter 2.1 --- Problem Statement --- p.16Chapter 2.2 --- A Preliminary Solution: Simple Pruning --- p.19Chapter 2.3 --- Efficient Monitoring by Local Clusters --- p.21Chapter 2.3.1 --- Incremental Local Clustering --- p.22Chapter 2.3.2 --- Batch Monitoring by Cluster Join --- p.24Chapter 2.3.3 --- Cost Analysis and Optimization --- p.28Chapter 2.4 --- Piecewise Index and Query Reschedule --- p.31Chapter 2.4.1 --- Piecewise VP-trees --- p.32Chapter 2.4.2 --- Candidate Rescheduling --- p.35Chapter 2.4.3 --- Cost Analysis --- p.36Chapter 2.5 --- Upper Bound Lemma: For Dynamic Time Warping Distance --- p.37Chapter 2.6 --- Experimental Evaluations --- p.39Chapter 2.6.1 --- Effectiveness --- p.40Chapter 2.6.2 --- Efficiency --- p.46Chapter 2.7 --- Related Work --- p.49Chapter 3 --- Privacy Preservation --- p.52Chapter 3.1 --- Problem Definition --- p.52Chapter 3.2 --- HD-Composition --- p.58Chapter 3.2.1 --- Role-based Partition --- p.59Chapter 3.2.2 --- Cohort-based Partition --- p.61Chapter 3.2.3 --- Privacy Guarantee --- p.70Chapter 3.2.4 --- Refinement of HD-composition --- p.75Chapter 3.2.5 --- Anonymization Algorithm --- p.76Chapter 3.3 --- Experiments --- p.77Chapter 3.3.1 --- Failures of Conventional Generalizations --- p.78Chapter 3.3.2 --- Evaluations of HD-Composition --- p.79Chapter 3.4 --- Related Work --- p.85Chapter 4 --- Conclusions --- p.87Bibliography --- p.8
Real-time detection of moving crowds using spatio-temporal data streams
Over the last decade we have seen a tremendous change in Location Based Services. From primitive reactive applications, explicitly invoked by users, they have evolved into modern complex proactive systems, that are able to automatically provide information based on context and user location. This was caused by the rapid development of outdoor and indoor positioning technologies. GPS modules, which are now included almost into every device, together with indoor technologies, based on WiFi fingerprinting or Bluetooth beacons, allow to determine the user location almost everywhere and at any time. This also led to an enormous growth of spatio-temporal data.
Being very efficient using user-centric approach for a single target current Location Based Services remain quite primitive in the area of a multitarget knowledge extraction. This is rather surprising, taking into consideration the data availability and current processing technologies. Discovering useful information from the location of multiple objects is from one side limited by legal issues related to privacy and data ownership. From the other side, mining group location data over time is not a trivial task and require special algorithms and technologies in order to be effective.
Recent development in data processing area has led to a huge shift from batch processing offline engines, like MapReduce, to real-time distributed streaming frameworks, like Apache Flink or Apache Spark, which are able to process huge amounts of data, including spatio-temporal datastreams.
This thesis presents a system for detecting and analyzing crowds in a continuous spatio-temporal data stream. The aim of the system is to provide relevant knowledge in terms of proactive LBS. The motivation comes from the fact of constant spatio-temporal data growth and recent rapid technological development to process such data
- …