Search CORE

9 research outputs found

Scalable Probabilistic Similarity Ranking in Uncertain Databases (Technical Report)

Author: Bernecker Thomas
Kriegel Hans-Peter
Mamoulis Nikos
Renz Matthias
Zuefle Andreas
Publication venue
Publication date: 01/01/2009
Field of study

This paper introduces a scalable approach for probabilistic top-k similarity ranking on uncertain vector data. Each uncertain object is represented by a set of vector instances that are assumed to be mutually-exclusive. The objective is to rank the uncertain data according to their distance to a reference object. We propose a framework that incrementally computes for each object instance and ranking position, the probability of the object falling at that ranking position. The resulting rank probability distribution can serve as input for several state-of-the-art probabilistic ranking models. Existing approaches compute this probability distribution by applying a dynamic programming approach of quadratic complexity. In this paper we theoretically as well as experimentally show that our framework reduces this to a linear-time complexity while having the same memory requirements, facilitated by incremental accessing of the uncertain vector instances in increasing order of their distance to the reference object. Furthermore, we show how the output of our method can be used to apply probabilistic top-k ranking for the objects, according to different state-of-the-art definitions. We conduct an experimental evaluation on synthetic and real data, which demonstrates the efficiency of our approach

arXiv.org e-Print Archive

CiteSeerX

HKU Scholars Hub

K-nearest neighbor search for fuzzy objects

Author: Fung Pui Cheong
Zheng Kai
Zhou Xiaofang
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2010
Field of study

The K-Nearest Neighbor search (kNN) problem has been investigated extensively in the past due to its broad range of applications. In this paper we study this problem in the context of fuzzy objects that have indeterministic boundaries. Fuzzy objects play an important role in many areas, such as biomedical image databases and GIS. Existing research on fuzzy objects mainly focuses on modelling basic fuzzy object types and operations, leaving the processing of more advanced queries such as kNN query untouched. In this paper, we propose two new kinds of kNN queries for fuzzy objects, Ad-hoc kNN query (AKNN) and Range kNN query (RKNN), to find the k nearest objects qualifying at a probability threshold or within a probability range. For efficient AKNN query processing, we optimize the basic best-first search algorithm by deriving more accurate approximations for the distance function between fuzzy objects and the query object. To improve the performance of RKNN search, effective pruning rules are developed to significantly reduce the search space and further speed up the candidate refinement process. The efficiency of our proposed algorithms as well as the optimization techniques are verified with an extensive set of experiments using both synthetic and real datasets

Crossref

University of Queensland eSpace

UV-Diagram: A Voronoi Diagram for Uncertain Spatial Databases

Author: Chen J
Cheng CK
Sun L
Xie X
Yiu ML
Publication venue
Publication date: 01/01/2013
Field of study

published_or_final_versio

HKU Scholars Hub

Recommended from our members

Complex Query Operators on Modern Parallel Architectures

Author: Zois Vasileios
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

Identifying interesting objects from a large data collection is a fundamental problem for multi-criteria decision making applications.In Relational Database Management Systems (RDBMS), the most popular complex query operators used to solve this type of problem are the Top-K selection operator and the Skyline operator.Top-K selection is tasked with retrieving the k-highest ranking tuples from a given relation, as determined by a user-defined aggregation function.Skyline selection retrieves those tuples with attributes offering (pareto) optimal trade-offs in a given relation.Efficient Top-K query processing entails minimizing tuple evaluations by utilizing elaborate processing schemes combined with sophisticated data structures that enable early termination.Skyline query evaluation involves supporting processing strategies which are geared towards early termination and incomparable tuple pruning.The rapid increase in memory capacity and decreasing costs have been the main drivers behind the development of main-memory database systems.Although the act of migrating query processing in-memory has created many opportunities to improve the associated query latency, attaining such improvements has been very challenging due to the growing gap between processor and main memory speeds.Addressing this limitation has been made easier by the rapid proliferation of multi-core and many-core architectures.However, their utilization in real systems has been hindered by the lack of suitable parallel algorithms that focus on algorithmic efficiency.In this thesis, we study in depth the Top-K and Skyline selection operators, in the context of emerging parallel architectures.Our ultimate goal is to provide practical guidelines for developing work-efficient algorithms suitable for parallel main memory processing.We concentrate on multi-core (CPU), many-core (GPU), and processing-in-memory architectures (PIM), developing solutions optimized for high throughout and low latency.The first part of this thesis focuses on Top-K selection, presenting the specific details of early termination algorithms that we developed specifically for parallel architectures and various types of accelerators (i.e. GPU, PIM).The second part of this thesis, concentrates on Skyline selection and the development of a massively parallel load balanced algorithm for PIM architectures.Our work consolidates performance results across different parallel architectures using synthetic and real data on variable query parameters and distributions for both of the aforementioned problems.The experimental results demonstrate several orders of magnitude better throughput and query latency, thus validating the effectiveness of our proposed solutions for the Top-K and Skyline selection operators

eScholarship - University of California

Doctor of Philosophy

Author: Jestes Jeffrey
Publication venue: University of Utah
Publication date: 01/12/2013
Field of study

dissertationWe are living in an age where data are being generated faster than anyone has previously imagined across a broad application domain, including customer studies, social media, sensor networks, and the sciences, among many others. In some cases, data are generated in massive quantities as terabytes or petabytes. There have been numerous emerging challenges when dealing with massive data, including: (1) the explosion in size of data; (2) data have increasingly more complex structures and rich semantics, such as representing temporal data as a piecewise linear representation; (3) uncertain data are becoming a common occurrence for numerous applications, e.g., scientific measurements or observations such as meteorological measurements; (4) and data are becoming increasingly distributed, e.g., distributed data collected and integrated from distributed locations as well as data stored in a distributed file system within a cluster. Due to the massive nature of modern data, it is oftentimes infeasible for computers to efficiently manage and query them exactly. An attractive alternative is to use data summarization techniques to construct data summaries, where even efficiently constructing data summaries is a challenging task given the enormous size of data. The data summaries we focus on in this thesis include the histogram and ranking operator. Both data summaries enable us to summarize a massive dataset to a more succinct representation which can then be used to make queries orders of magnitude more efficient while still allowing approximation guarantees on query answers. Our study has focused on the critical task of designing efficient algorithms to summarize, query, and manage massive data

The University of Utah: J. Willard Marriott Digital Library

빅데이터의 효율적인 스카이라인 질의 처리를 위한 병렬처리 알고리즘

Author: 박윤재
Publication venue: 서울대학교 대학원
Publication date: 01/02/2017
Field of study

학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2017. 2. 심규석.스카이라인 질의와 스카이라인에서 파생된 동적 스카이라인, 역 스카이라인 그리고 확률적 스카이라인 질의들은 다양한 응용이 가능하기 때문에 최근에 많은 연구가 진행되어 왔다. 스카이라인 질의들은 큰 데이터를 처리해야하는 경우가 많기 때문에 효율적인 스카이라인 질의 처리는 중요한 문제이다. 큰 데이터를 처리해야하는 경우를 위해 맵리듀스 프레임워크가 제안되었고, 따라서 본 논문에서는 스카이라인, 동적 스카이라인, 역 스카이라인, 확률적 스카이라인 질의 처리를 위한 효율적인 맵리듀스 알고리즘을 개발한다. 스카이라인, 동적 스카이라인, 역 스카이라인에 대해서는 질의 결과에 포함될 수 없는 데이터를 빠르게 제거하기 위해서 쿼드트리에 기반한 히스토그램을 생성한다. 그리고 히스토그램에 따라 데이터를 여러 파티션으로 나누고 각 파티션에 있는 데이터만을 이용하여 스카이라인이 될 수 있는 후보 데이터를 맵리듀스를 이용하여 병렬적으로 뽑아낸다. 그 후에 다시 맵리듀스를 사용하여 병렬적으로 후보 데이터중 실제 스카이라인을 찾아낸다. 확률적 스카이라인의 효율적인 처리를 위해 먼저 세가지 필터링 기법을 제안하였다. 이 필터링 기법을 활용할 수 있도록 쿼드트리에 기반한 히스토그램을 생성한다. 쿼드트리의 영역에 따라 데이터를 파티션하고 각 파티션마다 확률적 스카이라인 점들을 찾아낸다. 각 컴퓨터의 수행시간을 비슷하게 맞추기 위해서 부하균형 기법도 제안하였다. 다양한 실험을 통해 제안한 알고리즘의 성능들이 최신 관련 연구 보다 좋음을 확인하였고, 사용하는 컴퓨터의 수를 늘림에 따라 성능이 확장성을 갖고 있음을 확인하였다.The skyline operator and its variants such as dynamic skyline, reverse skyline and probabilistic skyline operators have attracted considerable attention recently due to its broad applications. However, computing a skyline is challenging today since we have to deal with big data. For data-intensive applications, the MapReduce framework has been widely used recently. In this dissertation, we propose the efficient parallel algorithms for processing skyline, dynamic skyline, reverse skyline and probabilistic skyline queries using MapReduce. For the skyline, dynamic skyline and reverse skyline queries, we first build quadtree-based histograms to prune out non-skyline points. We next partition data based on the regions divided by the histograms and compute candidate skyline points for each partition using MapReduce. Finally, in every partition, we check whether each skyline candidate point is actually a skyline point or not using MapReduce. For the probabilistic skyline query, we first introduce three filtering techniques to prune out points that are not probabilistic skyline points. Then, we build a quadtree-based histogram and split data into partitions according to the regions divided by the quadtree. We finally compute the probabilistic skyline points for each partition using MapReduce. We also develop the workload balancing methods to make the estimated execution times of all available machines to be similar. We did experiments to compare our algorithms with the state-of-the-art algorithms using MapReduce and confirmed the effectiveness as well as the scalability of our proposed skyline algorithms.1 INTRODUCTION 1 1.1 Motivation 1 1.2 Contributions of This Dissertation 6 1.3 Dissertation Overview 8 2 Related Work 10 2.1 Skyline Queries 10 2.2 Reverse Skyline Queries 13 2.3 Probabilistic Skyline Queries 14 3 Background 17 3.1 Skyline and Its Variants 17 3.2 MapReduce Framework 22 4 Parallel Skyline Query Processing 24 4.1 SKY-MR: Our Skyline Computation Algorithm 24 4.1.1 SKY-QTREE: The Sky-Quadtree Building Algorithm 25 4.1.2 L-SKY-MR: The Local Skyline Computation Algorithm 29 4.1.3 G-SKY-MR: The Global Skyline Computation Algorithm 32 4.2 Experiment 34 4.2.1 Performance Results for Skylines 36 4.2.2 Performance Results in Other Environments 41 5 Parallel Reverse Skyline Query Processing 45 5.1 RSKY-MR: Our Reverse Skyline Computation Algorithm 45 5.1.1 RSKY-QTREE: The Rsky-Quadtree Building Algorithm 47 5.1.2 Computations of Reverse Skylines using Rsky-Quadtrees 50 5.1.3 L-RSKY-MR: The Local Reverse Skyline Computation Algorithm 53 5.1.4 G-RSKY-MR: The Global Reverse Skyline Computation Algorithm 57 5.2 Experiment 59 5.2.1 Performance Results for Reverse Skylines 59 6 Parallel Probabilistic Skyline Query Processing 63 6.1 Early Pruning Techniques 63 6.1.1 Upper-bound Filtering 63 6.1.2 Zero-probability Filtering 67 6.1.3 Dominance-Power Filtering 68 6.2 Utilization of a PS-QTREE for Pruning 69 6.2.1 Generating a PS-QTREE 70 6.2.2 Exploiting a PS-QTREE for Filtering 70 6.2.3 Partitioning Objects by a PS-QTREE 71 6.3 PS-QPF-MR: Our Algorithm with Quadtree Partitiong and Filtering 73 6.3.1 Optimizations of PS-QPF-MR 79 6.3.2 Sample Size and Split Threshold of a PSQtree 83 6.4 PS-BRF-MR: Our Algorithm with Random Partitioning and Filtering 84 6.5 Experiments 87 6.5.1 Performance Results for Probabilistic Skylines 89 7 Conclusion 97 Bibliography 99 Abstract (In Korean) 105Docto

SNU Open Repository and Archive

Top-k spatial joins of probabilistic objects

Author: Ambuj K. Singh
Vebjorn Ljosa
Publication venue
Publication date: 01/01/2008
Field of study

Abstract — Probabilistic data have recently become popular in applications such as scientific and geospatial databases. For images and other spatial datasets, probabilistic values can capture the uncertainty in extent and class of the objects in the images. Relating one such dataset to another by spatial joins is an important operation for data management systems. We consider probabilistic spatial join (PSJ) queries, which rank the results according to a score that incorporates both the uncertainties associated with the objects and the distances between them. We present algorithms for two kinds of PSJ queries: Threshold PSJ queries, which return all pairs that score above a given threshold, and top-k PSJ queries, which return the k top-scoring pairs. For threshold PSJ queries, we propose a plane sweep algorithm that, because it exploits the special structure of the problem, runs in O(n (log n + k)) time, where n is the number of points and k is the number of results. We extend the algorithms to 2-D data and to top-k PSJ queries. To further speed up top-k PSJ queries, we develop a scheduling technique that estimates the scores at the level of blocks, then hands the blocks to the plane sweep algorithm. By finding high-scoring pairs early, the scheduling allows a large portion of the datasets to be pruned. Experiments demonstrate speed-ups of two orders of magnitude. I

CiteSeerX

Crossref