2,071 research outputs found
Finding Top-k Dominance on Incomplete Big Data Using Map-Reduce Framework
Incomplete data is one major kind of multi-dimensional dataset that has random-distributed missing nodes in its dimensions. It is very difficult to retrieve information from this type of dataset when it becomes huge. Finding top-k dominant values in this type of dataset is a challenging procedure. Some algorithms are present to enhance this process but are mostly efficient only when dealing with a small-size incomplete data. One of the algorithms that make the application of TKD query possible is the Bitmap Index Guided (BIG) algorithm. This algorithm strongly improves the performance for incomplete data, but it is not originally capable of finding top-k dominant values in incomplete big data, nor is it designed to do so. Several other algorithms have been proposed to find the TKD query, such as Skyband Based and Upper Bound Based algorithms, but their performance is also questionable. Algorithms developed previously were among the first attempts to apply TKD query on incomplete data; however, all these had weak performances or were not compatible with the incomplete data. This thesis proposes MapReduced Enhanced Bitmap Index Guided Algorithm (MRBIG) for dealing with the aforementioned issues. MRBIG uses the MapReduce framework to enhance the performance of applying top-k dominance queries on huge incomplete datasets. The proposed approach uses the MapReduce parallel computing approach using multiple computing nodes. The framework separates the tasks between several computing nodes that independently and simultaneously work to find the result. This method has achieved up to two times faster processing time in finding the TKD query result in comparison to previously presented algorithms
Computing All Restricted Skyline Probabilities on Uncertain Datasets
Restricted skyline (rskyline) query is widely used in multi-criteria decision
making. It generalizes the skyline query by additionally considering a set of
personalized scoring functions F. Since uncertainty is inherent in datasets for
multi-criteria decision making, we study rskyline queries on uncertain datasets
from both complexity and algorithm perspective. We formalize the problem of
computing rskyline probabilities of all data items and show that no algorithm
can solve this problem in truly subquadratic-time, unless the orthogonal
vectors conjecture fails. Considering that linear scoring functions are widely
used in practical applications, we propose two efficient algorithms for the
case where \calF is a set of linear scoring functions whose weights are
described by linear constraints, one with near-optimal time complexity and the
other with better expected time complexity. For special linear constraints
involving a series of weight ratios, we further devise an algorithm with
sublinear query time and polynomial preprocessing time. Extensive experiments
demonstrate the effectiveness, efficiency, scalability, and usefulness of our
proposed algorithms.Comment: Full version, a shorter version to appear in ICDE 202
Probabilistic Skyline Queries over Uncertain Moving Objects
Data uncertainty inherently exists in a large number of applications due to factors such as limitations of measuring equipments, update delay, and network bandwidth. Recently, modeling and querying uncertain data have attracted considerable attention from the database community. However, how to perform advanced analysis on uncertain data remains an interesting question. In this paper, we focus on the execution of skyline computation over uncertain moving objects. We propose a novel probabilistic skyline model where an uncertain object may take a probability to be in the skyline at a certain time point, therefore a p-t-skyline contains those moving objects whose skyline probabilities are at least p at time point t. Computing probabilistic skyline over a large number of uncertain moving objects is a daunting task in practice. In order to efficiently compute the probabilistic skyline query, we propose a discrete-and-conquer strategy, which follows the sampling-bounding-pruning-refining procedure. To further reduce the skyline computation cost, we propose an enhanced framework that is based on a multi-dimensional indexing structure combined with the discrete-and-conquer strategy. Through extensive experiments with synthetic datasets, we show that the framework can efficiently support skyline queries over uncertain moving object and is scalable on large data sets
Discovering Attractive Products based on Influence Sets
Skyline queries have been widely used as a practical tool for multi-criteria
decision analysis and for applications involving preference queries. For
example, in a typical online retail application, skyline queries can help
customers select the most interesting, among a pool of available, products.
Recently, reverse skyline queries have been proposed, highlighting the
manufacturer's perspective, i.e. how to determine the expected buyers of a
given product. In this work we develop novel algorithms for two important
classes of queries involving customer preferences. We first propose a novel
algorithm, termed as RSA, for answering reverse skyline queries. We then
introduce a new type of queries, namely the k-Most Attractive Candidates k-MAC
query. In this type of queries, given a set of existing product specifications
P, a set of customer preferences C and a set of new candidate products Q, the
k-MAC query returns the set of k candidate products from Q that jointly
maximizes the total number of expected buyers, measured as the cardinality of
the union of individual reverse skyline sets (i.e., influence sets). Applying
existing approaches to solve this problem would require calculating the reverse
skyline set for each candidate, which is prohibitively expensive for large data
sets. We, thus, propose a batched algorithm for this problem and compare its
performance against a branch-and-bound variant that we devise. Both of these
algorithms use in their core variants of our RSA algorithm. Our experimental
study using both synthetic and real data sets demonstrates that our proposed
algorithms outperform existing, or naive solutions to our studied classes of
queries
- …