19 research outputs found
Indexing Uncertain Categorical Data over Distributed Environment
International audienceToday, a large amount of uncertain data is produced by several applications where the management systems of traditional databases incuding indexing methods are not suitable to handle such type of data. In this paper, we propose an inverted based index method for effciently searching uncertain categorical data over distributed environments. We adress two kinds of query over the distributed uncertain databases, one a distributed probabilis-tic thresholds query, where all results sastisfying the query with probablities that meet a probablistic threshold requirement are returned, and another a distributed top k-queries, where all results optimizing the transfer of the tuples and the time treatment are returned
Infinite Probabilistic Databases
Probabilistic databases (PDBs) are used to model uncertainty in data in a quantitative way. In the standard formal framework, PDBs are finite probability spaces over relational database instances. It has been argued convincingly that this is not compatible with an open-world semantics (Ceylan et al., KR 2016) and with application scenarios that are modeled by continuous probability distributions (Dalvi et al., CACM 2009).
We recently introduced a model of PDBs as infinite probability spaces that addresses these issues (Grohe and Lindner, PODS 2019). While that work was mainly concerned with countably infinite probability spaces, our focus here is on uncountable spaces. Such an extension is necessary to model typical continuous probability distributions that appear in many applications. However, an extension beyond countable probability spaces raises nontrivial foundational issues concerned with the measurability of events and queries and ultimately with the question whether queries have a well-defined semantics.
It turns out that so-called finite point processes are the appropriate model from probability theory for dealing with probabilistic databases. This model allows us to construct suitable (uncountable) probability spaces of database instances in a systematic way. Our main technical results are measurability statements for relational algebra queries as well as aggregate queries and Datalog queries
Probabilistic Data with Continuous Distributions
Statistical models of real world data typically involve continuous
probability distributions such as normal, Laplace, or exponential
distributions. Such distributions are supported by many probabilistic modelling
formalisms, including probabilistic database systems. Yet, the traditional
theoretical framework of probabilistic databases focusses entirely on finite
probabilistic databases.
Only recently, we set out to develop the mathematical theory of infinite
probabilistic databases. The present paper is an exposition of two recent
papers which are cornerstones of this theory. In (Grohe, Lindner; ICDT 2020) we
propose a very general framework for probabilistic databases, possibly
involving continuous probability distributions, and show that queries have a
well-defined semantics in this framework. In (Grohe, Kaminski, Katoen, Lindner;
PODS 2020) we extend the declarative probabilistic programming language
Generative Datalog, proposed by (B\'ar\'any et al.~2017) for discrete
probability distributions, to continuous probability distributions and show
that such programs yield generative models of continuous probabilistic
databases
Querying Incomplete Numerical Data: Between Certain and Possible Answers
International audienc
Capturing Data Uncertainty in High-Volume Stream Processing
We present the design and development of a data stream system that captures
data uncertainty from data collection to query processing to final result
generation. Our system focuses on data that is naturally modeled as continuous
random variables. For such data, our system employs an approach grounded in
probability and statistical theory to capture data uncertainty and integrates
this approach into high-volume stream processing. The first component of our
system captures uncertainty of raw data streams from sensing devices. Since
such raw streams can be highly noisy and may not carry sufficient information
for query processing, our system employs probabilistic models of the data
generation process and stream-speed inference to transform raw data into a
desired format with an uncertainty metric. The second component captures
uncertainty as data propagates through query operators. To efficiently quantify
result uncertainty of a query operator, we explore a variety of techniques
based on probability and statistical theory to compute the result distribution
at stream speed. We are currently working with a group of scientists to
evaluate our system using traces collected from the domains of (and eventually
in the real systems for) hazardous weather monitoring and object tracking and
monitoring.Comment: CIDR 200
Probabilistic Shortest Time Queries Over Uncertain Road Networks
In many real applications such as location-based services (LBS), map utilities, trip planning, and transportation systems, it is very useful and important to provide query services over spatial road networks. Nowadays we can easily obtain rich traffic information such as the speeds of vehicles on roads. However, due to the inaccuracy of devices or integration in consistencies, the traffic data (i.e., speeds) are often imprecise and uncertain. In this paper, we model road networks by uncertain graphs, which contain edges that are associated with probabilistic velocities. We formalize the problem of probabilistic shortest time query, and we propose time bound pruning and probabilistic bound pruning to filter out false alarms. Moreover, we design offline pre-computation to facilitate PSTQ processing
Tuple-Independent Representations of Infinite Probabilistic Databases
Probabilistic databases (PDBs) are probability spaces over database
instances. They provide a framework for handling uncertainty in databases, as
occurs due to data integration, noisy data, data from unreliable sources or
randomized processes. Most of the existing theory literature investigated
finite, tuple-independent PDBs (TI-PDBs) where the occurrences of tuples are
independent events. Only recently, Grohe and Lindner (PODS '19) introduced
independence assumptions for PDBs beyond the finite domain assumption. In the
finite, a major argument for discussing the theoretical properties of TI-PDBs
is that they can be used to represent any finite PDB via views. This is no
longer the case once the number of tuples is countably infinite. In this paper,
we systematically study the representability of infinite PDBs in terms of
TI-PDBs and the related block-independent disjoint PDBs.
The central question is which infinite PDBs are representable as first-order
views over tuple-independent PDBs. We give a necessary condition for the
representability of PDBs and provide a sufficient criterion for
representability in terms of the probability distribution of a PDB. With
various examples, we explore the limits of our criteria. We show that
conditioning on first order properties yields no additional power in terms of
expressivity. Finally, we discuss the relation between purely logical and
arithmetic reasons for (non-)representability
Accuracy-Aware Uncertain Stream Databases
Abstract-Previous work has introduced probability distributions as first-class components in uncertain stream database systems. A lacking element is the fact of how accurate these probability distributions are. This indeed has a profound impact on the accuracy of query results presented to end users. While there is some previous work that studies unreliable intermediate query results in the tuple uncertainty model, to the best of our knowledge, we are the first to consider an uncertain stream database in which accuracy is taken into consideration all the way from the learned distributions based on raw data samples to the query results. We perform an initial study of various components in an accuracy-aware uncertain stream database system, including the representation of accuracy information and how to obtain query results' accuracy. In addition, we propose novel predicates based on hypothesis testing for decision-making using data with limited accuracy. We augment our study with a comprehensive set of experimental evaluations. I. INTRODUCTION Recent research has extended stream databases to handle uncertain data in order to meet the requirements from everincreasing applications in sensor networks and ubiquitous computing (e.g., Where do we obtain the probabilities in the first place? In many applications, probability distributions are learned from observations and measurements, a.k.a. samples. Such applications include sensor networks, ubiquitous computing, and scientific databases. Let us look at an example. Example 1 (accuracy of learned probability distributions). A few projects in both academia and industry (e.g., the CarTel project at MIT [24
Infinite Probabilistic Databases
Probabilistic databases (PDBs) model uncertainty in data in a quantitative
way. In the established formal framework, probabilistic (relational) databases
are finite probability spaces over relational database instances. This
finiteness can clash with intuitive query behavior (Ceylan et al., KR 2016),
and with application scenarios that are better modeled by continuous
probability distributions (Dalvi et al., CACM 2009).
We formally introduced infinite PDBs in (Grohe and Lindner, PODS 2019) with a
primary focus on countably infinite spaces. However, an extension beyond
countable probability spaces raises nontrivial foundational issues concerned
with the measurability of events and queries and ultimately with the question
whether queries have a well-defined semantics.
We argue that finite point processes are an appropriate model from
probability theory for dealing with general probabilistic databases. This
allows us to construct suitable (uncountable) probability spaces of database
instances in a systematic way. Our main technical results are measurability
statements for relational algebra queries as well as aggregate queries and
Datalog queries.Comment: This is the full version of the paper "Infinite Probabilistic
Databases" presented at ICDT 2020 (arXiv:1904.06766