4,346 research outputs found
A General Framework for Anytime Approximation in Probabilistic Databases
Anytime approximation algorithms that compute the probabilities of queries
over probabilistic databases can be of great use to statistical learning tasks.
Those approaches have been based so far on either (i) sampling or (ii)
branch-and-bound with model-based bounds. We present here a more general
branch-and-bound framework that extends the possible bounds by using
'dissociation', which yields tighter bounds.Comment: 3 pages, 2 figures, submitted to StarAI 2018 Worksho
Scalable Statistical Modeling and Query Processing over Large Scale Uncertain Databases
The past decade has witnessed a large number of novel applications that generate imprecise, uncertain and incomplete data. Examples include monitoring infrastructures such as RFIDs, sensor networks and web-based applications such as information extraction, data integration, social networking and so on. In my dissertation, I addressed several challenges in managing such data and developed algorithms for efficiently executing queries over large volumes of such data. Specifically, I focused on the following challenges.
First, for meaningful analysis of such data, we need the ability to remove noise and infer useful information from uncertain data. To address this challenge, I first developed a declarative system for applying dynamic probabilistic models to databases and data streams. The output of such probabilistic modeling is probabilistic data, i.e., data annotated with probabilities of correctness/existence. Often, the data also exhibits strong correlations. Although there is prior work in managing and querying such probabilistic data using probabilistic databases, those approaches largely assume independence and cannot handle probabilistic data with rich correlation structures. Hence, I built a probabilistic database system that can manage large-scale correlations and developed algorithms for efficient query evaluation. Our system allows users to provide uncertain data as input and to specify arbitrary correlations among the entries in the database. In the back end, we represent correlations as a forest of junction trees, an alternative representation for probabilistic graphical models (PGM). We execute queries over the probabilistic database by transforming them into message passing algorithms (inference) over the junction tree. However, traditional algorithms over junction trees typically require accessing the entire tree, even for small queries. Hence, I developed an index data structure over the junction tree called INDSEP that allows us to circumvent this process and thereby scalably evaluate inference queries, aggregation queries and SQL queries over the probabilistic database.
Finally, query evaluation in probabilistic databases typically returns output tuples along with their probability values. However, the existing query evaluation model provides very little intuition to the users: for instance, a user might want to know Why is this tuple in my result? or Why does this output tuple have such high probability? or Which are the most influential input tuples for my query ?'' Hence, I designed a query evaluation model, and a suite of algorithms, that provide users with explanations for query results, and enable users to perform sensitivity analysis to better understand the query results
Complaint-driven Training Data Debugging for Query 2.0
As the need for machine learning (ML) increases rapidly across all industry
sectors, there is a significant interest among commercial database providers to
support "Query 2.0", which integrates model inference into SQL queries.
Debugging Query 2.0 is very challenging since an unexpected query result may be
caused by the bugs in training data (e.g., wrong labels, corrupted features).
In response, we propose Rain, a complaint-driven training data debugging
system. Rain allows users to specify complaints over the query's intermediate
or final output, and aims to return a minimum set of training examples so that
if they were removed, the complaints would be resolved. To the best of our
knowledge, we are the first to study this problem. A naive solution requires
retraining an exponential number of ML models. We propose two novel heuristic
approaches based on influence functions which both require linear retraining
steps. We provide an in-depth analytical and empirical analysis of the two
approaches and conduct extensive experiments to evaluate their effectiveness
using four real-world datasets. Results show that Rain achieves the highest
recall@k among all the baselines while still returns results interactively.Comment: Proceedings of the 2020 ACM SIGMOD International Conference on
Management of Dat
Learning Tuple Probabilities
Learning the parameters of complex probabilistic-relational models from
labeled training data is a standard technique in machine learning, which has
been intensively studied in the subfield of Statistical Relational Learning
(SRL), but---so far---this is still an under-investigated topic in the context
of Probabilistic Databases (PDBs). In this paper, we focus on learning the
probability values of base tuples in a PDB from labeled lineage formulas. The
resulting learning problem can be viewed as the inverse problem to confidence
computations in PDBs: given a set of labeled query answers, learn the
probability values of the base tuples, such that the marginal probabilities of
the query answers again yield in the assigned probability labels. We analyze
the learning problem from a theoretical perspective, cast it into an
optimization problem, and provide an algorithm based on stochastic gradient
descent. Finally, we conclude by an experimental evaluation on three real-world
and one synthetic dataset, thus comparing our approach to various techniques
from SRL, reasoning in information extraction, and optimization
Probabilistic techniques for obtaining accurate patient counts in Clinical Data Warehouses
AbstractProposal and execution of clinical trials, computation of quality measures and discovery of correlation between medical phenomena are all applications where an accurate count of patients is needed. However, existing sources of this type of patient information, including Clinical Data Warehouses (CDWs) may be incomplete or inaccurate. This research explores applying probabilistic techniques, supported by the MayBMS probabilistic database, to obtain accurate patient counts from a Clinical Data Warehouse containing synthetic patient data.We present a synthetic Clinical Data Warehouse, and populate it with simulated data using a custom patient data generation engine. We then implement, evaluate and compare different techniques for obtaining patients counts.We model billing as a test for the presence of a condition. We compute billing’s sensitivity and specificity both by conducting a “Simulated Expert Review” where a representative sample of records are reviewed and labeled by experts, and by obtaining the ground truth for every record.We compute the posterior probability of a patient having a condition through a “Bayesian Chain”, using Bayes’ Theorem to calculate the probability of a patient having a condition after each visit. The second method is a “one-shot” approach that computes the probability of a patient having a condition based on whether the patient is ever billed for the condition.Our results demonstrate the utility of probabilistic approaches, which improve on the accuracy of raw counts. In particular, the simulated review paired with a single application of Bayes’ Theorem produces the best results, with an average error rate of 2.1% compared to 43.7% for the straightforward billing counts.Overall, this research demonstrates that Bayesian probabilistic approaches improve patient counts on simulated patient populations. We believe that total patient counts based on billing data are one of the many possible applications of our Bayesian framework. Use of these probabilistic techniques will enable more accurate patient counts and better results for applications requiring this metric
Computing Local Sensitivities of Counting Queries with Joins
Local sensitivity of a query Q given a database instance D, i.e. how much the
output Q(D) changes when a tuple is added to D or deleted from D, has many
applications including query analysis, outlier detection, and in differential
privacy. However, it is NP-hard to find local sensitivity of a conjunctive
query in terms of the size of the query, even for the class of acyclic queries.
Although the complexity is polynomial when the query size is fixed, the naive
algorithms are not efficient for large databases and queries involving multiple
joins. In this paper, we present a novel approach to compute local sensitivity
of counting queries involving join operations by tracking and summarizing tuple
sensitivities -- the maximum change a tuple can cause in the query result when
it is added or removed. We give algorithms for the sensitivity problem for full
acyclic join queries using join trees, that run in polynomial time in both the
size of the database and query for an interesting sub-class of queries, which
we call 'doubly acyclic queries' that include path queries, and in polynomial
time in combined complexity when the maximum degree in the join tree is
bounded. Our algorithms can be extended to certain non-acyclic queries using
generalized hypertree decompositions. We evaluate our approach experimentally,
and show applications of our algorithms to obtain better results for
differential privacy by orders of magnitude.Comment: To be published in Proceedings of the 2020 ACM SIGMOD International
Conference on Management of Dat
Counterfactuals and Causability in Explainable Artificial Intelligence: Theory, Algorithms, and Applications
There has been a growing interest in model-agnostic methods that can make
deep learning models more transparent and explainable to a user. Some
researchers recently argued that for a machine to achieve a certain degree of
human-level explainability, this machine needs to provide human causally
understandable explanations, also known as causability. A specific class of
algorithms that have the potential to provide causability are counterfactuals.
This paper presents an in-depth systematic review of the diverse existing body
of literature on counterfactuals and causability for explainable artificial
intelligence. We performed an LDA topic modelling analysis under a PRISMA
framework to find the most relevant literature articles. This analysis resulted
in a novel taxonomy that considers the grounding theories of the surveyed
algorithms, together with their underlying properties and applications in
real-world data. This research suggests that current model-agnostic
counterfactual algorithms for explainable AI are not grounded on a causal
theoretical formalism and, consequently, cannot promote causability to a human
decision-maker. Our findings suggest that the explanations derived from major
algorithms in the literature provide spurious correlations rather than
cause/effects relationships, leading to sub-optimal, erroneous or even biased
explanations. This paper also advances the literature with new directions and
challenges on promoting causability in model-agnostic approaches for
explainable artificial intelligence
- …