982 research outputs found

    Finding Top-k Dominance on Incomplete Big Data Using Map-Reduce Framework

    Full text link
    Incomplete data is one major kind of multi-dimensional dataset that has random-distributed missing nodes in its dimensions. It is very difficult to retrieve information from this type of dataset when it becomes huge. Finding top-k dominant values in this type of dataset is a challenging procedure. Some algorithms are present to enhance this process but are mostly efficient only when dealing with a small-size incomplete data. One of the algorithms that make the application of TKD query possible is the Bitmap Index Guided (BIG) algorithm. This algorithm strongly improves the performance for incomplete data, but it is not originally capable of finding top-k dominant values in incomplete big data, nor is it designed to do so. Several other algorithms have been proposed to find the TKD query, such as Skyband Based and Upper Bound Based algorithms, but their performance is also questionable. Algorithms developed previously were among the first attempts to apply TKD query on incomplete data; however, all these had weak performances or were not compatible with the incomplete data. This thesis proposes MapReduced Enhanced Bitmap Index Guided Algorithm (MRBIG) for dealing with the aforementioned issues. MRBIG uses the MapReduce framework to enhance the performance of applying top-k dominance queries on huge incomplete datasets. The proposed approach uses the MapReduce parallel computing approach using multiple computing nodes. The framework separates the tasks between several computing nodes that independently and simultaneously work to find the result. This method has achieved up to two times faster processing time in finding the TKD query result in comparison to previously presented algorithms

    A model for computing skyline data items in cloud incomplete databases

    Get PDF
    Skyline queries intend to retrieve the most superior data items in the database that best fit with the user’s given preference. However, processing skyline queries are expensive and uneasy when applying on large distributed databases such as cloud databases. Moreover, it would be further sophisticated to process skyline queries if these distributed databases have missing values in certain dimensions. The effect of data incompleteness on skyline process is extremely severe because missing values result in un-hold the transitivity property of skyline technique and leads to the problem of cyclic dominance. This paper proposes an efficient model for computing skyline data items in cloud incomplete databases. The model focuses on processing skyline queries in cloud incomplete databases aiming at reducing the domination tests between data items, the processing time, and the amount of data transfer among the involved datacenters. Various set of experiments are conducted over two different types of datasets and the result demonstrates that the proposed solution outperforms the previous approaches in terms of domination tests, processing time, and amount of data transferred

    Skyline queries over incomplete multidimensional database

    Get PDF
    In recent years, there has been much focus on skyline queries that incorporate and provide more flexible query operators that return data items which are dominating other data items in all attributes (dimensions).Several techniques for skyline have been proposed in the literature.Most of the existing skyline techniques aimed to find the skyline query results by supposing that the values of dimensions are always present for every data item.In this paper we aim to evaluate the skyline preference queries in which some dimension values are missing.We proposed an approach for answering preference queries in a database by utilizing the concept of skyline technique.The skyline set selected for a given query operation is then optimized so that the missing values are replaced with some approximate values that provide a skyline answer with complete data.This will significantly reduce the number of comparisons between data items.Beside that, the number of retrieved skyline data items is reduced and this guides the users to select the most appropriate data items from the several alternative complete skyline data items

    Missing values estimation for skylines in incomplete database

    Get PDF
    Incompleteness of data is a common problem in many databases including web heterogeneous databases, multi-relational databases, spatial and temporal databases and data integration. The incompleteness of data introduces challenges in processing queries as providing accurate results that best meet the query conditions over incomplete database is not a trivial task. Several techniques have been proposed to process queries in incomplete database. Some of these techniques retrieve the query results based on the existing values rather than estimating the missing values. Such techniques are undesirable in many cases as the dimensions with missing values might be the important dimensions of the user’s query. Besides, the output is incomplete and might not satisfy the user preferences. In this paper we propose an approach that estimates missing values in skylines to guide users in selecting the most appropriate skylines from the several candidate skylines. The approach utilizes the concept of mining attribute correlations to generate an Approximate Functional Dependencies (AFDs) that captured the relationships between the dimensions. Besides, identifying the strength of probability correlations to estimate the values. Then, the skylines with estimated values are ranked. By doing so, we ensure that the retrieved skylines are in the order of their estimated precision

    Skyline queries computation on crowdsourced- enabled incomplete database

    Get PDF
    Data incompleteness becomes a frequent phenomenon in a large number of contemporary database applications such as web autonomous databases, big data, and crowd-sourced databases. Processing skyline queries over incomplete databases impose a number of challenges that negatively influence processing the skyline queries. Most importantly, the skylines derived from incomplete databases are also incomplete in which some values are missing. Retrieving skylines with missing values is undesirable, particularly, for recommendation and decision-making systems. Furthermore, running skyline queries on a database with incomplete data raises a number of issues influence processing skyline queries such as losing the transitivity property of the skyline technique and cyclic dominance between the tuples. The issue of estimating the missing values of skylines has been discussed and examined in the database literature. Most recently, several studies have suggested exploiting the crowd-sourced databases in order to estimate the missing values by generating plausible values using the crowd. Crowd-sourced databases have proved to be a powerful solution to perform user-given tasks by integrating human intelligence and experience to process the tasks. However, task processing using crowd-sourced incurs additional monetary cost and increases the time latency. Also, it is not always possible to produce a satisfactory result that meets the user's preferences. This paper proposes an approach for estimating the missing values of the skylines by first exploiting the available data and utilizes the implicit relationships between the attributes in order to impute the missing values of the skylines. This process aims at reducing the number of values to be estimated using the crowd when local estimation is inappropriate. Intensive experiments on both synthetic and real datasets have been accomplished. The experimental results have proven that the proposed approach for estimating the missing values of the skylines over crowd-sourced enabled incomplete databases is scalable and outperforms the other existing approaches

    Policy-Aware Unbiased Learning to Rank for Top-k Rankings

    Get PDF
    Counterfactual Learning to Rank (LTR) methods optimize ranking systems using logged user interactions that contain interaction biases. Existing methods are only unbiased if users are presented with all relevant items in every ranking. There is currently no existing counterfactual unbiased LTR method for top-k rankings. We introduce a novel policy-aware counterfactual estimator for LTR metrics that can account for the effect of a stochastic logging policy. We prove that the policy-aware estimator is unbiased if every relevant item has a non-zero probability to appear in the top-k ranking. Our experimental results show that the performance of our estimator is not affected by the size of k: for any k, the policy-aware estimator reaches the same retrieval performance while learning from top-k feedback as when learning from feedback on the full ranking. Lastly, we introduce novel extensions of traditional LTR methods to perform counterfactual LTR and to optimize top-k metrics. Together, our contributions introduce the first policy-aware unbiased LTR approach that learns from top-k feedback and optimizes top-k metrics. As a result, counterfactual LTR is now applicable to the very prevalent top-k ranking setting in search and recommendation.Comment: SIGIR 2020 full conference pape

    Deriving skyline points over dynamic and incomplete databases

    Get PDF
    The rapid growth of data is inevitable, and retrieving the best results that meet the user’s preferences is essential.To achieve this, skylines were introduced in which data items that are not dominated by the other data items in the database are retrieved as results (skylines).In most of the existing skyline approaches, the databases are assumed to be static and complete.However, in real world scenario, databases are not complete especially in multidimensional databases in which some dimensions may have missing values.The databases might also be dynamic in which new data items are inserted while existing data items are deleted or updated.Blindly performing pairwise comparisons on the whole data items after the changes are made is inappropriate as not all data items need to be compared in identifying the skylines. Thus, a novel skyline algorithm, DInSkyline, is proposed in this study which finds the most relevant data items in dynamic and incomplete databases. Several experiments have been conducted and the results show that DInSkyline outperforms the previous works by reducing the number of pairwise comparisons in the range of 52% to 73%