47,576 research outputs found
Crowdsourcing for Top-K Query Processing over Uncertain Data
Querying uncertain data has become a prominent application due to the proliferation of user-generated content from social media and of data streams from sensors. When data ambiguity cannot be reduced algorithmically, crowdsourcing proves a viable approach, which consists of posting tasks to humans and harnessing their judgment for improving the confidence about data values or relationships. This paper tackles the problem of processing top- K queries over uncertain data with the help of crowdsourcing for quickly converging to the realordering of relevant results. Several offline and online approaches for addressing questions to a crowd are defined and contrasted on both synthetic and real data sets, with the aim of minimizing the crowd interactions necessary to find the realordering of the result set
Efficient SUM Query Processing over Uncertain Data
Selected as one of the best papersNational audienceSUM queries are crucial for many applications that need to deal with probabilistic data. In this paper, we are interested in the queries, called ALL_SUM, that return all possible sum values and their probabilities. In general, there is no efficient solution for the problem of evaluating ALL_SUM queries. But, for many practical applications, where aggregate values are small integers or real numbers with small precision, it is possible to develop efficient solutions. In this paper, based on a recursive approach, we propose a new solution for this problem. We implemented our solution and conducted an extensive experimental evaluation over synthetic and real-world data sets; the results show its effectiveness
Query Join Processing over Uncertain Data for Decision Tree Classifiers
Traditional decision tree classifiers work with the data whose values are known and precise. We can also extend those classifiers to handle data with uncertain information. Value uncertainty arises in many applications during the data collection process. Example sources of uncertainty measurement/quantization errors, data staleness, and multiple repeated measurements. Rather than abstracting uncertain data by statistical derivatives, such as mean and median, the accuracy of a decision tree classifier can be improved much if the complete information of a data item is used by utilizing the Probability Density Function (PDF). In particular, an attribute value can be modelled as a range of possible values, associated with a PDF. The PDF function has only addressed simple queries such as range and nearestneighbour queries. Queries that join multiple relations have not been addressed with PDF. Despite the significance of joins in databases, we address join queries over uncertain data. We propose semantics for the join operation, define probabilistic operators over uncertain data, and propose join algorithms that provide efficient execution of probabilistic joins especially threshold. In which we avoid the semantic complexities that deals with uncertain data. For this class of joins we develop three sets of optimization techniques: item-level, page-level, and index-level pruning. We will compare the performance of these techniques experimentally
Probabilistic threshold range aggregate query processing over uncertain data
Uncertainty is inherent in many novel and important applications such as market surveillance, information extraction sensor data analysis, etc. In the recent a few decades, uncertain data has attracted considerable research attention. There are various factors that cause the uncertainty, for instance randomness or incompleteness of data, limitations of equipment and delay or loss in data transfer.
A probabilistic threshold range aggregate (PRTA) query retrieves summarized information about the uncertain objects in the database satisfying a range query, with respect to a given probability threshold. This thesis is trying to address and handle this important type of query which there is no previous work studying on.
We formulate the problem in both discrete and continuous uncertain data model and develop a novel index structure, asU-tree (aggregate-based sampling-auxiliary U-tree) which not only supports exact query answering but also provides approximate results with accuracy guarantee if efficiency is more concerned. The new asU-tree structure is totally dynamic. Query processing algorithms for both exact answer and approximate answer based on this new index structure are also proposed. An extensive experimental study shows that asU-tree is very efficient and effective over real and synthetic datasets
Query Join Processing over Uncertain Data for Decision Tree Classifiers
Traditional decision tree classifiers work with the data whose values are known and precise. We can also extend those classifiers to handle data with uncertain information. Value uncertainty arises in many applications during the data collection process. Example sources of uncertainty measurement/quantization errors, data staleness, and multiple repeated measurements. Rather than abstracting uncertain data by statistical derivatives, such as mean and median, the accuracy of a decision tree classifier can be improved much if the complete information of a data item is used by utilizing the Probability Density Function (PDF). In particular, an attribute value can be modelled as a range of possible values, associated with a PDF. The PDF function has only addressed simple queries such as range and nearestneighbour queries. Queries that join multiple relations have not been addressed with PDF. Despite the significance of joins in databases, we address join queries over uncertain data. We propose semantics for the join operation, define probabilistic operators over uncertain data, and propose join algorithms that provide efficient execution of probabilistic joins especially threshold. In which we avoid the semantic complexities that deals with uncertain data. For this class of joins we develop three sets of optimization techniques: item-level, page-level, and index-level pruning. We will compare the performance of these techniques experimentally
Efficient Processing of Continuous Skyline
The analyzing and processing of multisource real-time transportation data stream lay a foundation for the smart transportation's sensibility, interconnection, integration, and real-time decision making. Strong computing ability and valid mass data management mode provided by the cloud computing, is feasible for handling Skyline continuous query in the mass distributed uncertain transportation data stream. In this paper, we gave architecture of layered smart transportation about data processing, and we formalized the description about continuous query over smart transportation data Skyline. Besides, we proposed mMR-SUDS algorithm (Skyline query algorithm of uncertain transportation stream data based on micro-batchinMap Reduce) based on sliding window division and architecture
Recommended from our members
Supporting Scientific Analytics under Data Uncertainty and Query Uncertainty
Data management is becoming increasingly important in many applications, in particular, in large scientific databases where (1) data can be naturally modeled by continuous random variables, and (2) queries can involve complex predicates and/or be difficult for users to express explicitly. My thesis work aims to provide efficient support to both the data uncertainty and the query uncertainty .
When data is uncertain, an important class of queries requires query answers to be returned if their existence probabilities pass a threshold. I start with optimizing such threshold query processing for continuous uncertain data in the relational model by (i) expediting selections by reducing dimensionality of integration and using faster filters, (ii) expediting joins using new indexes on uncertain data, and (iii) optimizing a query plan using a dynamic, per-tuple based approach. Evaluation results using real-world data and benchmark queries show the accuracy and efficiency of my techniques and the dynamic query planning has over 50% performance gains in most cases over a state-of-the-art threshold query optimizer and is very close to the optimal planning in all cases.
Next I address uncertain data management in the array model, which has gained popu- larity for scientific data processing recently due to performance benefits. I define the formal semantics of array operations on uncertain data involving both value uncertainty within individual tuples and position uncertainty regarding where a tuple should belong in an array given uncertain dimension attributes, and propose a suite of storage and evaluation strategies for array operators, with a focus on a novel scheme that bounds the overhead of querying by strategically placing a few replicas of the tuples with large variances. Evaluation results show that for common workloads, my best-performing techniques outperform baselines up to 1 to 2 orders of magnitude while incurring only small storage overhead.
Finally, to bridge the increasing gap between the fast growth of data and the limited human ability to comprehend data and help the user retrieve high-value content from data more effectively, I propose to build interactive data exploration as a new database service, using an approach called “explore-by-example”. To build an effective system, my work is grounded in a rigorous SVM-based active learning framework and focuses on the following three problems: (i) accuracy-based and convergence-based stopping criteria, (ii) expediting example acquisition in each iteration, and (iii) expediting the final result retrieval. Evaluation results using real-world data and query patterns show that my system significantly outperforms state-of-the-art systems in accuracy (18x accuracy improvement for 4-dimensional workloads) while achieving desired efficiency for interactive exploration (2 to 5 seconds per iteration)
- …