1,525 research outputs found
Finding Top-k Dominance on Incomplete Big Data Using Map-Reduce Framework
Incomplete data is one major kind of multi-dimensional dataset that has random-distributed missing nodes in its dimensions. It is very difficult to retrieve information from this type of dataset when it becomes huge. Finding top-k dominant values in this type of dataset is a challenging procedure. Some algorithms are present to enhance this process but are mostly efficient only when dealing with a small-size incomplete data. One of the algorithms that make the application of TKD query possible is the Bitmap Index Guided (BIG) algorithm. This algorithm strongly improves the performance for incomplete data, but it is not originally capable of finding top-k dominant values in incomplete big data, nor is it designed to do so. Several other algorithms have been proposed to find the TKD query, such as Skyband Based and Upper Bound Based algorithms, but their performance is also questionable. Algorithms developed previously were among the first attempts to apply TKD query on incomplete data; however, all these had weak performances or were not compatible with the incomplete data. This thesis proposes MapReduced Enhanced Bitmap Index Guided Algorithm (MRBIG) for dealing with the aforementioned issues. MRBIG uses the MapReduce framework to enhance the performance of applying top-k dominance queries on huge incomplete datasets. The proposed approach uses the MapReduce parallel computing approach using multiple computing nodes. The framework separates the tasks between several computing nodes that independently and simultaneously work to find the result. This method has achieved up to two times faster processing time in finding the TKD query result in comparison to previously presented algorithms
Recommended from our members
Complex Query Operators on Modern Parallel Architectures
Identifying interesting objects from a large data collection is a fundamental problem for multi-criteria decision making applications.In Relational Database Management Systems (RDBMS), the most popular complex query operators used to solve this type of problem are the Top-K selection operator and the Skyline operator.Top-K selection is tasked with retrieving the k-highest ranking tuples from a given relation, as determined by a user-defined aggregation function.Skyline selection retrieves those tuples with attributes offering (pareto) optimal trade-offs in a given relation.Efficient Top-K query processing entails minimizing tuple evaluations by utilizing elaborate processing schemes combined with sophisticated data structures that enable early termination.Skyline query evaluation involves supporting processing strategies which are geared towards early termination and incomparable tuple pruning.The rapid increase in memory capacity and decreasing costs have been the main drivers behind the development of main-memory database systems.Although the act of migrating query processing in-memory has created many opportunities to improve the associated query latency, attaining such improvements has been very challenging due to the growing gap between processor and main memory speeds.Addressing this limitation has been made easier by the rapid proliferation of multi-core and many-core architectures.However, their utilization in real systems has been hindered by the lack of suitable parallel algorithms that focus on algorithmic efficiency.In this thesis, we study in depth the Top-K and Skyline selection operators, in the context of emerging parallel architectures.Our ultimate goal is to provide practical guidelines for developing work-efficient algorithms suitable for parallel main memory processing.We concentrate on multi-core (CPU), many-core (GPU), and processing-in-memory architectures (PIM), developing solutions optimized for high throughout and low latency.The first part of this thesis focuses on Top-K selection, presenting the specific details of early termination algorithms that we developed specifically for parallel architectures and various types of accelerators (i.e. GPU, PIM).The second part of this thesis, concentrates on Skyline selection and the development of a massively parallel load balanced algorithm for PIM architectures.Our work consolidates performance results across different parallel architectures using synthetic and real data on variable query parameters and distributions for both of the aforementioned problems.The experimental results demonstrate several orders of magnitude better throughput and query latency, thus validating the effectiveness of our proposed solutions for the Top-K and Skyline selection operators
Query Profiler Versus Cache for Skyline Computation
A skyline query is multi preference user query which generates the best objects from a multi attributed dataset. Skyline computation in an optimum time becomes a real challenge when the number of user preference are large and size of the dataset is also huge. When such a big data gets queried at large, response time optimization is possible through maintenance of the metadata about the pre-executed skyline queries. We have earlier proposed, a novel structure namely �Query Profiler� which preserves such metadata about the historical queries, raised against a dataset. Also as the dataset gets queried at large, the dimensions of user queries often overlap and queries get correlated. Such correlations in user queries and the availability of metadata about the earlier queries, combined together speed up the computation time and the optimization of the response time of the further skyline computation becomes possible. In this paper, we assert the efficacy of the Query Profiler by comparing its performance with the parallel techniques which utilize cache mechanism for optimization of the response time. We also present the experimental results which assert the efficacy of the proposed technique
SkyFlow: heterogeneous streaming for skyline computation using FlowGraph and SYCL
The skyline is an optimization operator widely used for multi-criteria decision making. It allows minimizing an n-dimensional dataset into its smallest subset. In this work we present SkyFlow, the first heterogeneous CPU+GPU graph-based engine for skyline computation on a stream of data queries. Two data flow approaches, Coarse-grained and Fine-grained, have been proposed for different streaming scenarios. Coarse-grained aims to keep in parallel the computation of two queries using a hybrid solution with two state-of-the-art skyline algorithms: one optimized for CPU and another for GPU. We also propose a model to estimate at runtime the computation time of any arriving data query. This estimation is used by a heuristic to schedule the data query on the device queue in which it will finish earlier. On the other hand, Fine-grained splits one query computation between CPU and GPU. An experimental evaluation using as target architecture a heterogeneous system comprised of a multicore CPU and an integrated GPU for different streaming scenarios and datasets, reveals that our heterogeneous CPU+GPU approaches always outperform previous only-CPU and only-GPU state-of-the-art implementations up to 6.86×and 5.19×, respectively, and they fall below 6% of ideal peak performance at most. We also evaluate Coarse-grained vs Fine-Grained finding that each approach is better suited to different streaming scenarios.This work was partially supported by the Spanish projects PID2019-105396RB-I00, UMA18-FEDERJA-108 and P20-00395-R. // Funding for open access charge: Universidad de Málaga / CBUA
Parallel Continuous Preference Queries over Out-of-Order and Bursty Data Streams
Techniques to handle traffic bursts and out-of-order arrivals are of paramount importance to provide real-time sensor data analytics in domains like traffic surveillance, transportation management, healthcare and security applications. In these systems the amount of raw data coming from sensors must be analyzed by continuous queries that extract value-added information used to make informed decisions in real-time. To perform this task with timing constraints, parallelism must be exploited in the query execution in order to enable the real-time processing on parallel architectures. In this paper we focus on continuous preference queries, a representative class of continuous queries for decision making, and we propose a parallel query model targeting the efficient processing over out-of-order and bursty data streams. We study how to integrate punctuation mechanisms in order to enable out-of-order processing. Then, we present advanced scheduling strategies targeting scenarios with different burstiness levels, parameterized using the index of dispersion quantity. Extensive experiments have been performed using synthetic datasets and real-world data streams obtained from an existing real-time locating system. The experimental evaluation demonstrates the efficiency of our parallel solution and its effectiveness in handling the out-of-orderness degrees and burstiness levels of real-world applications
I/O-Efficient Planar Range Skyline and Attrition Priority Queues
In the planar range skyline reporting problem, we store a set P of n 2D
points in a structure such that, given a query rectangle Q = [a_1, a_2] x [b_1,
b_2], the maxima (a.k.a. skyline) of P \cap Q can be reported efficiently. The
query is 3-sided if an edge of Q is grounded, giving rise to two variants:
top-open (b_2 = \infty) and left-open (a_1 = -\infty) queries.
All our results are in external memory under the O(n/B) space budget, for
both the static and dynamic settings:
* For static P, we give structures that answer top-open queries in O(log_B n
+ k/B), O(loglog_B U + k/B), and O(1 + k/B) I/Os when the universe is R^2, a U
x U grid, and a rank space grid [O(n)]^2, respectively (where k is the number
of reported points). The query complexity is optimal in all cases.
* We show that the left-open case is harder, such that any linear-size
structure must incur \Omega((n/B)^e + k/B) I/Os for a query. We show that this
case is as difficult as the general 4-sided queries, for which we give a static
structure with the optimal query cost O((n/B)^e + k/B).
* We give a dynamic structure that supports top-open queries in O(log_2B^e
(n/B) + k/B^1-e) I/Os, and updates in O(log_2B^e (n/B)) I/Os, for any e
satisfying 0 \le e \le 1. This leads to a dynamic structure for 4-sided queries
with optimal query cost O((n/B)^e + k/B), and amortized update cost O(log
(n/B)).
As a contribution of independent interest, we propose an I/O-efficient
version of the fundamental structure priority queue with attrition (PQA). Our
PQA supports FindMin, DeleteMin, and InsertAndAttrite all in O(1) worst case
I/Os, and O(1/B) amortized I/Os per operation.
We also add the new CatenateAndAttrite operation that catenates two PQAs in
O(1) worst case and O(1/B) amortized I/Os. This operation is a non-trivial
extension to the classic PQA of Sundar, even in internal memory.Comment: Appeared at PODS 2013, New York, 19 pages, 10 figures. arXiv admin
note: text overlap with arXiv:1208.4511, arXiv:1207.234
- …