37 research outputs found

    빅데이터의 효율적인 스카이라인 질의 처리를 위한 병렬처리 알고리즘

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2017. 2. 심규석.스카이라인 질의와 스카이라인에서 파생된 동적 스카이라인, 역 스카이라인 그리고 확률적 스카이라인 질의들은 다양한 응용이 가능하기 때문에 최근에 많은 연구가 진행되어 왔다. 스카이라인 질의들은 큰 데이터를 처리해야하는 경우가 많기 때문에 효율적인 스카이라인 질의 처리는 중요한 문제이다. 큰 데이터를 처리해야하는 경우를 위해 맵리듀스 프레임워크가 제안되었고, 따라서 본 논문에서는 스카이라인, 동적 스카이라인, 역 스카이라인, 확률적 스카이라인 질의 처리를 위한 효율적인 맵리듀스 알고리즘을 개발한다. 스카이라인, 동적 스카이라인, 역 스카이라인에 대해서는 질의 결과에 포함될 수 없는 데이터를 빠르게 제거하기 위해서 쿼드트리에 기반한 히스토그램을 생성한다. 그리고 히스토그램에 따라 데이터를 여러 파티션으로 나누고 각 파티션에 있는 데이터만을 이용하여 스카이라인이 될 수 있는 후보 데이터를 맵리듀스를 이용하여 병렬적으로 뽑아낸다. 그 후에 다시 맵리듀스를 사용하여 병렬적으로 후보 데이터중 실제 스카이라인을 찾아낸다. 확률적 스카이라인의 효율적인 처리를 위해 먼저 세가지 필터링 기법을 제안하였다. 이 필터링 기법을 활용할 수 있도록 쿼드트리에 기반한 히스토그램을 생성한다. 쿼드트리의 영역에 따라 데이터를 파티션하고 각 파티션마다 확률적 스카이라인 점들을 찾아낸다. 각 컴퓨터의 수행시간을 비슷하게 맞추기 위해서 부하균형 기법도 제안하였다. 다양한 실험을 통해 제안한 알고리즘의 성능들이 최신 관련 연구 보다 좋음을 확인하였고, 사용하는 컴퓨터의 수를 늘림에 따라 성능이 확장성을 갖고 있음을 확인하였다.The skyline operator and its variants such as dynamic skyline, reverse skyline and probabilistic skyline operators have attracted considerable attention recently due to its broad applications. However, computing a skyline is challenging today since we have to deal with big data. For data-intensive applications, the MapReduce framework has been widely used recently. In this dissertation, we propose the efficient parallel algorithms for processing skyline, dynamic skyline, reverse skyline and probabilistic skyline queries using MapReduce. For the skyline, dynamic skyline and reverse skyline queries, we first build quadtree-based histograms to prune out non-skyline points. We next partition data based on the regions divided by the histograms and compute candidate skyline points for each partition using MapReduce. Finally, in every partition, we check whether each skyline candidate point is actually a skyline point or not using MapReduce. For the probabilistic skyline query, we first introduce three filtering techniques to prune out points that are not probabilistic skyline points. Then, we build a quadtree-based histogram and split data into partitions according to the regions divided by the quadtree. We finally compute the probabilistic skyline points for each partition using MapReduce. We also develop the workload balancing methods to make the estimated execution times of all available machines to be similar. We did experiments to compare our algorithms with the state-of-the-art algorithms using MapReduce and confirmed the effectiveness as well as the scalability of our proposed skyline algorithms.1 INTRODUCTION 1 1.1 Motivation 1 1.2 Contributions of This Dissertation 6 1.3 Dissertation Overview 8 2 Related Work 10 2.1 Skyline Queries 10 2.2 Reverse Skyline Queries 13 2.3 Probabilistic Skyline Queries 14 3 Background 17 3.1 Skyline and Its Variants 17 3.2 MapReduce Framework 22 4 Parallel Skyline Query Processing 24 4.1 SKY-MR: Our Skyline Computation Algorithm 24 4.1.1 SKY-QTREE: The Sky-Quadtree Building Algorithm 25 4.1.2 L-SKY-MR: The Local Skyline Computation Algorithm 29 4.1.3 G-SKY-MR: The Global Skyline Computation Algorithm 32 4.2 Experiment 34 4.2.1 Performance Results for Skylines 36 4.2.2 Performance Results in Other Environments 41 5 Parallel Reverse Skyline Query Processing 45 5.1 RSKY-MR: Our Reverse Skyline Computation Algorithm 45 5.1.1 RSKY-QTREE: The Rsky-Quadtree Building Algorithm 47 5.1.2 Computations of Reverse Skylines using Rsky-Quadtrees 50 5.1.3 L-RSKY-MR: The Local Reverse Skyline Computation Algorithm 53 5.1.4 G-RSKY-MR: The Global Reverse Skyline Computation Algorithm 57 5.2 Experiment 59 5.2.1 Performance Results for Reverse Skylines 59 6 Parallel Probabilistic Skyline Query Processing 63 6.1 Early Pruning Techniques 63 6.1.1 Upper-bound Filtering 63 6.1.2 Zero-probability Filtering 67 6.1.3 Dominance-Power Filtering 68 6.2 Utilization of a PS-QTREE for Pruning 69 6.2.1 Generating a PS-QTREE 70 6.2.2 Exploiting a PS-QTREE for Filtering 70 6.2.3 Partitioning Objects by a PS-QTREE 71 6.3 PS-QPF-MR: Our Algorithm with Quadtree Partitiong and Filtering 73 6.3.1 Optimizations of PS-QPF-MR 79 6.3.2 Sample Size and Split Threshold of a PSQtree 83 6.4 PS-BRF-MR: Our Algorithm with Random Partitioning and Filtering 84 6.5 Experiments 87 6.5.1 Performance Results for Probabilistic Skylines 89 7 Conclusion 97 Bibliography 99 Abstract (In Korean) 105Docto

    Large-Scale Spatial Data Management on Modern Parallel and Distributed Platforms

    Full text link
    Rapidly growing volume of spatial data has made it desirable to develop efficient techniques for managing large-scale spatial data. Traditional spatial data management techniques cannot meet requirements of efficiency and scalability for large-scale spatial data processing. In this dissertation, we have developed new data-parallel designs for large-scale spatial data management that can better utilize modern inexpensive commodity parallel and distributed platforms, including multi-core CPUs, many-core GPUs and computer clusters, to achieve both efficiency and scalability. After introducing background on spatial data management and modern parallel and distributed systems, we present our parallel designs for spatial indexing and spatial join query processing on both multi-core CPUs and GPUs for high efficiency as well as their integrations with Big Data systems for better scalability. Experiment results using real world datasets demonstrate the effectiveness and efficiency of the proposed techniques on managing large-scale spatial data

    Efficient processing of large-scale spatio-temporal data

    Get PDF
    Millionen Geräte, wie z.B. Mobiltelefone, Autos und Umweltsensoren senden ihre Positionen zusammen mit einem Zeitstempel und weiteren Nutzdaten an einen Server zu verschiedenen Analysezwecken. Die Positionsinformationen und übertragenen Ereignisinformationen werden als Punkte oder Polygone dargestellt. Eine weitere Art räumlicher Daten sind Rasterdaten, die zum Beispiel von Kameras und Sensoren produziert werden. Diese großen räumlich-zeitlichen Datenmengen können nur auf skalierbaren Plattformen wie Hadoop und Apache Spark verarbeitet werden, die jedoch z.B. die Nachbarschaftsinformation nicht ausnutzen können - was die Ausführung bestimmter Anfragen praktisch unmöglich macht. Die wiederholten Ausführungen der Analyseprogramme während ihrer Entwicklung und durch verschiedene Nutzer resultieren in langen Ausführungszeiten und hohen Kosten für gemietete Ressourcen, die durch die Wiederverwendung von Zwischenergebnissen reduziert werden können. Diese Arbeit beschäftigt sich mit den beiden oben beschriebenen Herausforderungen. Wir präsentieren zunächst das STARK Framework für die Verarbeitung räumlich-zeitlicher Vektor- und Rasterdaten in Apache Spark. Wir identifizieren verschiedene Algorithmen für Operatoren und analysieren, wie diese von den Eigenschaften der zugrundeliegenden Plattform profitieren können. Weiterhin wird untersucht, wie Indexe in der verteilten und parallelen Umgebung realisiert werden können. Außerdem vergleichen wir Partitionierungsmethoden, die unterschiedlich gut mit ungleichmäßiger Datenverteilung und der Größe der Datenmenge umgehen können und präsentieren einen Ansatz um die auf Operatorebene zu verarbeitende Datenmenge frühzeitig zu reduzieren. Um die Ausführungszeit von Programmen zu verkürzen, stellen wir einen Ansatz zur transparenten Materialisierung von Zwischenergebnissen vor. Dieser Ansatz benutzt ein Entscheidungsmodell, welches auf den tatsächlichen Operatorkosten basiert. In der Evaluierung vergleichen wir die verschiedenen Implementierungs- sowie Konfigurationsmöglichkeiten in STARK und identifizieren Szenarien wann Partitionierung und Indexierung eingesetzt werden sollten. Außerdem vergleichen wir STARK mit verwandten Systemen. Im zweiten Teil der Evaluierung zeigen wir, dass die transparente Wiederverwendung der materialisierten Zwischenergebnisse die Ausführungszeit der Programme signifikant verringern kann.Millions of location-aware devices, such as mobile phones, cars, and environmental sensors constantly report their positions often in combination with a timestamp to a server for different kinds of analyses. While the location information of the devices and reported events is represented as points and polygons, raster data is another type of spatial data, which is for example produced by cameras and sensors. This Big spatio-temporal Data needs to be processed on scalable platforms, such as Hadoop and Apache Spark, which, however, are unaware of, e.g., spatial neighborhood, what makes them practically impossible to use for this kind of data. The repeated executions of the programs during development and by different users result in long execution times and potentially high costs in rented clusters, which can be reduced by reusing commonly computed intermediate results. Within this thesis, we tackle the two challenges described above. First, we present the STARK framework for processing spatio-temporal vector and raster data on the Apache Spark stack. For operators, we identify several possible algorithms and study how they can benefit from the underlying platform's properties. We further investigate how indexes can be realized in the distributed and parallel architecture of Big Data processing engines and compare methods for data partitioning, which perform differently well with respect to data skew and data set size. Furthermore, an approach to reduce the amount of data to process at operator level is presented. In order to reduce the execution times, we introduce an approach to transparently recycle intermediate results of dataflow programs, based on operator costs. To compute the costs, we instrument the programs with profiling code to gather the execution time and result size of the operators. In the evaluation, we first compare the various implementation and configuration possibilities in STARK and identify scenarios when and how partitioning and indexing should be applied. We further compare STARK to related systems and show that we can achieve significantly better execution times, not only when exploiting existing partitioning information. In the second part of the evaluation, we show that with the transparent cost-based materialization and recycling of intermediate results, the execution times of programs can be reduced significantly

    Efficient Group K Nearest-Neighbor Spatial Query Processing in Apache Spark

    Get PDF
    Aiming at the problem of spatial query processing in distributed computing systems, the design and implementation of new distributed spatial query algorithms is a current challenge. Apache Spark is a memory-based framework suitable for real-time and batch processing. Spark-based systems allow users to work on distributed in-memory data, without worrying about the data distribution mechanism and fault-tolerance. Given two datasets of points (called Query and Training), the group K nearest-neighbor (GKNN) query retrieves (K) points of the Training with the smallest sum of distances to every point of the Query. This spatial query has been actively studied in centralized environments and several performance improving techniques and pruning heuristics have been also proposed, while, a distributed algorithm in Apache Hadoop was recently proposed by our team. Since, in general, Apache Hadoop exhibits lower performance than Spark, in this paper, we present the first distributed GKNN query algorithm in Apache Spark and compare it against the one in Apache Hadoop. This algorithm incorporates programming features and facilities that are specific to Apache Spark. Moreover, techniques that improve performance and are applicable in Apache Spark are also incorporated. The results of an extensive set of experiments with real-world spatial datasets are presented, demonstrating that our Apache Spark GKNN solution, with its improvements, is efficient and a clear winner in comparison to processing this query in Apache Hadoop

    SwiftSpatial: Spatial Joins on Modern Hardware

    Full text link
    Spatial joins are among the most time-consuming queries in spatial data management systems. In this paper, we propose SwiftSpatial, a specialized accelerator architecture tailored for spatial joins. SwiftSpatial contains multiple high-performance join units with innovative hybrid parallelism, several efficient memory management units, and an integrated on-chip join scheduler. We prototype SwiftSpatial on an FPGA and incorporate the R-tree synchronous traversal algorithm as the control flow. Benchmarked against various CPU and GPU-based spatial data processing systems, SwiftSpatial demonstrates a latency reduction of up to 5.36x relative to the best-performing baseline, while requiring 6.16x less power. The remarkable performance and energy efficiency of SwiftSpatial lay a solid foundation for its future integration into spatial data management systems, both in data centers and at the edge

    AT-GIS: highly parallel spatial query processing with associative transducers

    Get PDF
    Users in many domains, including urban planning, transportation, and environmental science want to execute analytical queries over continuously updated spatial datasets. Current solutions for largescale spatial query processing either rely on extensions to RDBMS, which entails expensive loading and indexing phases when the data changes, or distributed map/reduce frameworks, running on resource-hungry compute clusters. Both solutions struggle with the sequential bottleneck of parsing complex, hierarchical spatial data formats, which frequently dominates query execution time. Our goal is to fully exploit the parallelism offered by modern multicore CPUs for parsing and query execution, thus providing the performance of a cluster with the resources of a single machine. We describe AT-GIS, a highly-parallel spatial query processing system that scales linearly to a large number of CPU cores. ATGIS integrates the parsing and querying of spatial data using a new computational abstraction called associative transducers(ATs). ATs can form a single data-parallel pipeline for computation without requiring the spatial input data to be split into logically independent blocks. Using ATs, AT-GIS can execute, in parallel, spatial query operators on the raw input data in multiple formats, without any pre-processing. On a single 64-core machine, AT-GIS provides 3× the performance of an 8-node Hadoop cluster with 192 cores for containment queries, and 10× for aggregation queries

    Load-balanced Range Query Workload Partitioning for Compressed Spatial Hierarchical Bitmap (cSHB) Indexes

    Get PDF
    abstract: The spatial databases are used to store geometric objects such as points, lines, polygons. Querying such complex spatial objects becomes a challenging task. Index structures are used to improve the lookup performance of the stored objects in the databases, but traditional index structures cannot perform well in case of spatial databases. A significant amount of research is made to ingest, index and query the spatial objects based on different types of spatial queries, such as range, nearest neighbor, and join queries. Compressed Spatial Bitmap Index (cSHB) structure is one such example of indexing and querying approach that supports spatial range query workloads (set of queries). cSHB indexes and many other approaches lack parallel computation. The massive amount of spatial data requires a lot of computation and traditional methods are insufficient to address these issues. Other existing parallel processing approaches lack in load-balancing of parallel tasks which leads to resource overloading bottlenecks. In this thesis, I propose novel spatial partitioning techniques, Max Containment Clustering and Max Containment Clustering with Separation, to create load-balanced partitions of a range query workload. Each partition takes a similar amount of time to process the spatial queries and reduces the response latency by minimizing the disk access cost and optimizing the bitmap operations. The partitions created are processed in parallel using cSHB indexes. The proposed techniques utilize the block-based organization of bitmaps in the cSHB index and improve the performance of the cSHB index for processing a range query workload.Dissertation/ThesisMasters Thesis Computer Science 201

    Efficient Distance Join Query Processing in Distributed Spatial Data Management Systems

    Get PDF
    Due to the ubiquitous use of spatial data applications and the large amounts of such data these applications use, the processing of large-scale distance joins in distributed systems is becoming increasingly popular. Distance Join Queries (DJQs) are important and frequently used operations in numerous applications, including data mining, multimedia and spatial databases. DJQs (e.g., k Nearest Neighbor Join Query, k Closest Pair Query, ε Distance Join Query, etc.) are costly operations, since they involve both the join and distance-based search, and performing DJQs efficiently is a challenging task. Recent Big Data developments have motivated the emergence of novel technologies for distributed processing of large-scale spatial data in clusters of computers, leading to Distributed Spatial Data Management Systems (DSDMSs). Distributed cluster-based computing systems can be classified as Hadoop-based or Spark-based systems. Based on this classification, in this paper, we compare two of the most recent and leading DSDMSs, SpatialHadoop and LocationSpark, by evaluating the performance of several existing and newly proposed parallel and distributed DJQ algorithms under various settings with large spatial real-world datasets. A general conclusion arising from the execution of the distributed DJQ algorithms studied is that, while SpatialHadoop is a robust and efficient system when large spatial datasets are joined (since it is built on top of the mature Hadoop platform), LocationSpark is the clear winner in total execution time efficiency when medium spatial datasets are combined (due to in-memory processing provided by Spark). However, LocationSpark requires higher memory allocation when large spatial datasets are involved in DJQs (even more so when k and ε are large). Finally, this detailed performance study has demonstrated that the new distributed DJQ algorithms we have proposed are efficient, robust and scalable with respect to different parameters, such as dataset sizes, k, ε and number of computing nodes

    Quality of Service Aware Data Stream Processing for Highly Dynamic and Scalable Applications

    Get PDF
    Huge amounts of georeferenced data streams are arriving daily to data stream management systems that are deployed for serving highly scalable and dynamic applications. There are innumerable ways at which those loads can be exploited to gain deep insights in various domains. Decision makers require an interactive visualization of such data in the form of maps and dashboards for decision making and strategic planning. Data streams normally exhibit fluctuation and oscillation in arrival rates and skewness. Those are the two predominant factors that greatly impact the overall quality of service. This requires data stream management systems to be attuned to those factors in addition to the spatial shape of the data that may exaggerate the negative impact of those factors. Current systems do not natively support services with quality guarantees for dynamic scenarios, leaving the handling of those logistics to the user which is challenging and cumbersome. Three workloads are predominant for any data stream, batch processing, scalable storage and stream processing. In this thesis, we have designed a quality of service aware system, SpatialDSMS, that constitutes several subsystems that are covering those loads and any mixed load that results from intermixing them. Most importantly, we natively have incorporated quality of service optimizations for processing avalanches of geo-referenced data streams in highly dynamic application scenarios. This has been achieved transparently on top of the codebases of emerging de facto standard best-in-class representatives, thus relieving the overburdened shoulders of the users in the presentation layer from having to reason about those services. Instead, users express their queries with quality goals and our system optimizers compiles that down into query plans with an embedded quality guarantee and leaves logistic handling to the underlying layers. We have developed standard compliant prototypes for all the subsystems that constitutes SpatialDSMS

    Doctor of Philosophy

    Get PDF
    dissertationDataflow pipeline models are widely used in visualization systems. Despite recent advancements in parallel architecture, most systems still support only a single CPU or a small collection of CPUs such as a SMP workstation. Even for systems that are specifically tuned towards parallel visualization, their execution models only provide support for data-parallelism while ignoring taskparallelism and pipeline-parallelism. With the recent popularization of machines equipped with multicore CPUs and multi-GPU units, these visualization systems are undoubtedly falling further behind in reaching maximum efficiency. On the other hand, there exist several libraries that can schedule program executions on multiple CPUs and/or multiple GPUs. However, due to differences in executing a task graph and a pipeline along with their APIs being considerably low-level, it still remains a challenge to integrate these run-time libraries into current visualization systems. Thus, there is a need for a redesigned dataflow architecture to fully support and exploit the power of highly parallel machines in large-scale visualization. The new design must be able to schedule executions on heterogeneous platforms while at the same time supporting arbitrarily large datasets through the use of streaming data structures. The primary goal of this dissertation work is to develop a parallel dataflow architecture for streaming large-scale visualizations. The framework includes supports for platforms ranging from multicore processors to clusters consisting of thousands CPUs and GPUs. We achieve this in our system by introducing the notion of Virtual Processing Elements and Task-Oriented Modules along with a highly customizable scheduler that controls the assignment of tasks to elements dynamically. This creates an intuitive way to maintain multiple CPU/GPU kernels yet still provide coherency and synchronization across module executions. We have implemented these techniques into HyperFlow which is made of an API with all basic dataflow constructs described in the dissertation, and a distributed run-time library that can be used to deploy those pipelines on multicore, multi-GPU and cluster-based platforms
    corecore