9 research outputs found

    Scalable Probabilistic Similarity Ranking in Uncertain Databases (Technical Report)

    Get PDF
    This paper introduces a scalable approach for probabilistic top-k similarity ranking on uncertain vector data. Each uncertain object is represented by a set of vector instances that are assumed to be mutually-exclusive. The objective is to rank the uncertain data according to their distance to a reference object. We propose a framework that incrementally computes for each object instance and ranking position, the probability of the object falling at that ranking position. The resulting rank probability distribution can serve as input for several state-of-the-art probabilistic ranking models. Existing approaches compute this probability distribution by applying a dynamic programming approach of quadratic complexity. In this paper we theoretically as well as experimentally show that our framework reduces this to a linear-time complexity while having the same memory requirements, facilitated by incremental accessing of the uncertain vector instances in increasing order of their distance to the reference object. Furthermore, we show how the output of our method can be used to apply probabilistic top-k ranking for the objects, according to different state-of-the-art definitions. We conduct an experimental evaluation on synthetic and real data, which demonstrates the efficiency of our approach

    K-nearest neighbor search for fuzzy objects

    Get PDF
    The K-Nearest Neighbor search (kNN) problem has been investigated extensively in the past due to its broad range of applications. In this paper we study this problem in the context of fuzzy objects that have indeterministic boundaries. Fuzzy objects play an important role in many areas, such as biomedical image databases and GIS. Existing research on fuzzy objects mainly focuses on modelling basic fuzzy object types and operations, leaving the processing of more advanced queries such as kNN query untouched. In this paper, we propose two new kinds of kNN queries for fuzzy objects, Ad-hoc kNN query (AKNN) and Range kNN query (RKNN), to find the k nearest objects qualifying at a probability threshold or within a probability range. For efficient AKNN query processing, we optimize the basic best-first search algorithm by deriving more accurate approximations for the distance function between fuzzy objects and the query object. To improve the performance of RKNN search, effective pruning rules are developed to significantly reduce the search space and further speed up the candidate refinement process. The efficiency of our proposed algorithms as well as the optimization techniques are verified with an extensive set of experiments using both synthetic and real datasets

    UV-Diagram: A Voronoi Diagram for Uncertain Spatial Databases

    Get PDF
    published_or_final_versio

    Doctor of Philosophy

    Get PDF
    dissertationWe are living in an age where data are being generated faster than anyone has previously imagined across a broad application domain, including customer studies, social media, sensor networks, and the sciences, among many others. In some cases, data are generated in massive quantities as terabytes or petabytes. There have been numerous emerging challenges when dealing with massive data, including: (1) the explosion in size of data; (2) data have increasingly more complex structures and rich semantics, such as representing temporal data as a piecewise linear representation; (3) uncertain data are becoming a common occurrence for numerous applications, e.g., scientific measurements or observations such as meteorological measurements; (4) and data are becoming increasingly distributed, e.g., distributed data collected and integrated from distributed locations as well as data stored in a distributed file system within a cluster. Due to the massive nature of modern data, it is oftentimes infeasible for computers to efficiently manage and query them exactly. An attractive alternative is to use data summarization techniques to construct data summaries, where even efficiently constructing data summaries is a challenging task given the enormous size of data. The data summaries we focus on in this thesis include the histogram and ranking operator. Both data summaries enable us to summarize a massive dataset to a more succinct representation which can then be used to make queries orders of magnitude more efficient while still allowing approximation guarantees on query answers. Our study has focused on the critical task of designing efficient algorithms to summarize, query, and manage massive data

    ๋น…๋ฐ์ดํ„ฐ์˜ ํšจ์œจ์ ์ธ ์Šค์นด์ด๋ผ์ธ ์งˆ์˜ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2017. 2. ์‹ฌ๊ทœ์„.์Šค์นด์ด๋ผ์ธ ์งˆ์˜์™€ ์Šค์นด์ด๋ผ์ธ์—์„œ ํŒŒ์ƒ๋œ ๋™์  ์Šค์นด์ด๋ผ์ธ, ์—ญ ์Šค์นด์ด๋ผ์ธ ๊ทธ๋ฆฌ๊ณ  ํ™•๋ฅ ์  ์Šค์นด์ด๋ผ์ธ ์งˆ์˜๋“ค์€ ๋‹ค์–‘ํ•œ ์‘์šฉ์ด ๊ฐ€๋Šฅํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ตœ๊ทผ์— ๋งŽ์€ ์—ฐ๊ตฌ๊ฐ€ ์ง„ํ–‰๋˜์–ด ์™”๋‹ค. ์Šค์นด์ด๋ผ์ธ ์งˆ์˜๋“ค์€ ํฐ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•ด์•ผํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๊ธฐ ๋•Œ๋ฌธ์— ํšจ์œจ์ ์ธ ์Šค์นด์ด๋ผ์ธ ์งˆ์˜ ์ฒ˜๋ฆฌ๋Š” ์ค‘์š”ํ•œ ๋ฌธ์ œ์ด๋‹ค. ํฐ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•ด์•ผํ•˜๋Š” ๊ฒฝ์šฐ๋ฅผ ์œ„ํ•ด ๋งต๋ฆฌ๋“€์Šค ํ”„๋ ˆ์ž„์›Œํฌ๊ฐ€ ์ œ์•ˆ๋˜์—ˆ๊ณ , ๋”ฐ๋ผ์„œ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์Šค์นด์ด๋ผ์ธ, ๋™์  ์Šค์นด์ด๋ผ์ธ, ์—ญ ์Šค์นด์ด๋ผ์ธ, ํ™•๋ฅ ์  ์Šค์นด์ด๋ผ์ธ ์งˆ์˜ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ํšจ์œจ์ ์ธ ๋งต๋ฆฌ๋“€์Šค ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ฐœ๋ฐœํ•œ๋‹ค. ์Šค์นด์ด๋ผ์ธ, ๋™์  ์Šค์นด์ด๋ผ์ธ, ์—ญ ์Šค์นด์ด๋ผ์ธ์— ๋Œ€ํ•ด์„œ๋Š” ์งˆ์˜ ๊ฒฐ๊ณผ์— ํฌํ•จ๋  ์ˆ˜ ์—†๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ๋น ๋ฅด๊ฒŒ ์ œ๊ฑฐํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ฟผ๋“œํŠธ๋ฆฌ์— ๊ธฐ๋ฐ˜ํ•œ ํžˆ์Šคํ† ๊ทธ๋žจ์„ ์ƒ์„ฑํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ํžˆ์Šคํ† ๊ทธ๋žจ์— ๋”ฐ๋ผ ๋ฐ์ดํ„ฐ๋ฅผ ์—ฌ๋Ÿฌ ํŒŒํ‹ฐ์…˜์œผ๋กœ ๋‚˜๋ˆ„๊ณ  ๊ฐ ํŒŒํ‹ฐ์…˜์— ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋งŒ์„ ์ด์šฉํ•˜์—ฌ ์Šค์นด์ด๋ผ์ธ์ด ๋  ์ˆ˜ ์žˆ๋Š” ํ›„๋ณด ๋ฐ์ดํ„ฐ๋ฅผ ๋งต๋ฆฌ๋“€์Šค๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ณ‘๋ ฌ์ ์œผ๋กœ ๋ฝ‘์•„๋‚ธ๋‹ค. ๊ทธ ํ›„์— ๋‹ค์‹œ ๋งต๋ฆฌ๋“€์Šค๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ณ‘๋ ฌ์ ์œผ๋กœ ํ›„๋ณด ๋ฐ์ดํ„ฐ์ค‘ ์‹ค์ œ ์Šค์นด์ด๋ผ์ธ์„ ์ฐพ์•„๋‚ธ๋‹ค. ํ™•๋ฅ ์  ์Šค์นด์ด๋ผ์ธ์˜ ํšจ์œจ์ ์ธ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•ด ๋จผ์ € ์„ธ๊ฐ€์ง€ ํ•„ํ„ฐ๋ง ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ด ํ•„ํ„ฐ๋ง ๊ธฐ๋ฒ•์„ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ์ฟผ๋“œํŠธ๋ฆฌ์— ๊ธฐ๋ฐ˜ํ•œ ํžˆ์Šคํ† ๊ทธ๋žจ์„ ์ƒ์„ฑํ•œ๋‹ค. ์ฟผ๋“œํŠธ๋ฆฌ์˜ ์˜์—ญ์— ๋”ฐ๋ผ ๋ฐ์ดํ„ฐ๋ฅผ ํŒŒํ‹ฐ์…˜ํ•˜๊ณ  ๊ฐ ํŒŒํ‹ฐ์…˜๋งˆ๋‹ค ํ™•๋ฅ ์  ์Šค์นด์ด๋ผ์ธ ์ ๋“ค์„ ์ฐพ์•„๋‚ธ๋‹ค. ๊ฐ ์ปดํ“จํ„ฐ์˜ ์ˆ˜ํ–‰์‹œ๊ฐ„์„ ๋น„์Šทํ•˜๊ฒŒ ๋งž์ถ”๊ธฐ ์œ„ํ•ด์„œ ๋ถ€ํ•˜๊ท ํ˜• ๊ธฐ๋ฒ•๋„ ์ œ์•ˆํ•˜์˜€๋‹ค. ๋‹ค์–‘ํ•œ ์‹คํ—˜์„ ํ†ตํ•ด ์ œ์•ˆํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์„ฑ๋Šฅ๋“ค์ด ์ตœ์‹  ๊ด€๋ จ ์—ฐ๊ตฌ ๋ณด๋‹ค ์ข‹์Œ์„ ํ™•์ธํ•˜์˜€๊ณ , ์‚ฌ์šฉํ•˜๋Š” ์ปดํ“จํ„ฐ์˜ ์ˆ˜๋ฅผ ๋Š˜๋ฆผ์— ๋”ฐ๋ผ ์„ฑ๋Šฅ์ด ํ™•์žฅ์„ฑ์„ ๊ฐ–๊ณ  ์žˆ์Œ์„ ํ™•์ธํ•˜์˜€๋‹ค.The skyline operator and its variants such as dynamic skyline, reverse skyline and probabilistic skyline operators have attracted considerable attention recently due to its broad applications. However, computing a skyline is challenging today since we have to deal with big data. For data-intensive applications, the MapReduce framework has been widely used recently. In this dissertation, we propose the efficient parallel algorithms for processing skyline, dynamic skyline, reverse skyline and probabilistic skyline queries using MapReduce. For the skyline, dynamic skyline and reverse skyline queries, we first build quadtree-based histograms to prune out non-skyline points. We next partition data based on the regions divided by the histograms and compute candidate skyline points for each partition using MapReduce. Finally, in every partition, we check whether each skyline candidate point is actually a skyline point or not using MapReduce. For the probabilistic skyline query, we first introduce three filtering techniques to prune out points that are not probabilistic skyline points. Then, we build a quadtree-based histogram and split data into partitions according to the regions divided by the quadtree. We finally compute the probabilistic skyline points for each partition using MapReduce. We also develop the workload balancing methods to make the estimated execution times of all available machines to be similar. We did experiments to compare our algorithms with the state-of-the-art algorithms using MapReduce and confirmed the effectiveness as well as the scalability of our proposed skyline algorithms.1 INTRODUCTION 1 1.1 Motivation 1 1.2 Contributions of This Dissertation 6 1.3 Dissertation Overview 8 2 Related Work 10 2.1 Skyline Queries 10 2.2 Reverse Skyline Queries 13 2.3 Probabilistic Skyline Queries 14 3 Background 17 3.1 Skyline and Its Variants 17 3.2 MapReduce Framework 22 4 Parallel Skyline Query Processing 24 4.1 SKY-MR: Our Skyline Computation Algorithm 24 4.1.1 SKY-QTREE: The Sky-Quadtree Building Algorithm 25 4.1.2 L-SKY-MR: The Local Skyline Computation Algorithm 29 4.1.3 G-SKY-MR: The Global Skyline Computation Algorithm 32 4.2 Experiment 34 4.2.1 Performance Results for Skylines 36 4.2.2 Performance Results in Other Environments 41 5 Parallel Reverse Skyline Query Processing 45 5.1 RSKY-MR: Our Reverse Skyline Computation Algorithm 45 5.1.1 RSKY-QTREE: The Rsky-Quadtree Building Algorithm 47 5.1.2 Computations of Reverse Skylines using Rsky-Quadtrees 50 5.1.3 L-RSKY-MR: The Local Reverse Skyline Computation Algorithm 53 5.1.4 G-RSKY-MR: The Global Reverse Skyline Computation Algorithm 57 5.2 Experiment 59 5.2.1 Performance Results for Reverse Skylines 59 6 Parallel Probabilistic Skyline Query Processing 63 6.1 Early Pruning Techniques 63 6.1.1 Upper-bound Filtering 63 6.1.2 Zero-probability Filtering 67 6.1.3 Dominance-Power Filtering 68 6.2 Utilization of a PS-QTREE for Pruning 69 6.2.1 Generating a PS-QTREE 70 6.2.2 Exploiting a PS-QTREE for Filtering 70 6.2.3 Partitioning Objects by a PS-QTREE 71 6.3 PS-QPF-MR: Our Algorithm with Quadtree Partitiong and Filtering 73 6.3.1 Optimizations of PS-QPF-MR 79 6.3.2 Sample Size and Split Threshold of a PSQtree 83 6.4 PS-BRF-MR: Our Algorithm with Random Partitioning and Filtering 84 6.5 Experiments 87 6.5.1 Performance Results for Probabilistic Skylines 89 7 Conclusion 97 Bibliography 99 Abstract (In Korean) 105Docto

    Top-k spatial joins of probabilistic objects

    No full text
    Abstract โ€” Probabilistic data have recently become popular in applications such as scientific and geospatial databases. For images and other spatial datasets, probabilistic values can capture the uncertainty in extent and class of the objects in the images. Relating one such dataset to another by spatial joins is an important operation for data management systems. We consider probabilistic spatial join (PSJ) queries, which rank the results according to a score that incorporates both the uncertainties associated with the objects and the distances between them. We present algorithms for two kinds of PSJ queries: Threshold PSJ queries, which return all pairs that score above a given threshold, and top-k PSJ queries, which return the k top-scoring pairs. For threshold PSJ queries, we propose a plane sweep algorithm that, because it exploits the special structure of the problem, runs in O(n (log n + k)) time, where n is the number of points and k is the number of results. We extend the algorithms to 2-D data and to top-k PSJ queries. To further speed up top-k PSJ queries, we develop a scheduling technique that estimates the scores at the level of blocks, then hands the blocks to the plane sweep algorithm. By finding high-scoring pairs early, the scheduling allows a large portion of the datasets to be pruned. Experiments demonstrate speed-ups of two orders of magnitude. I
    corecore