5,523 research outputs found
Explain3D: Explaining Disagreements in Disjoint Datasets
Data plays an important role in applications, analytic processes, and many
aspects of human activity. As data grows in size and complexity, we are met
with an imperative need for tools that promote understanding and explanations
over data-related operations. Data management research on explanations has
focused on the assumption that data resides in a single dataset, under one
common schema. But the reality of today's data is that it is frequently
un-integrated, coming from different sources with different schemas. When
different datasets provide different answers to semantically similar questions,
understanding the reasons for the discrepancies is challenging and cannot be
handled by the existing single-dataset solutions.
In this paper, we propose Explain3D, a framework for explaining the
disagreements across disjoint datasets (3D). Explain3D focuses on identifying
the reasons for the differences in the results of two semantically similar
queries operating on two datasets with potentially different schemas. Our
framework leverages the queries to perform a semantic mapping across the
relevant parts of their provenance; discrepancies in this mapping point to
causes of the queries' differences. Exploiting the queries gives Explain3D an
edge over traditional schema matching and record linkage techniques, which are
query-agnostic. Our work makes the following contributions: (1) We formalize
the problem of deriving optimal explanations for the differences of the results
of semantically similar queries over disjoint datasets. (2) We design a 3-stage
framework for solving the optimal explanation problem. (3) We develop a
smart-partitioning optimizer that improves the efficiency of the framework by
orders of magnitude. (4)~We experiment with real-world and synthetic data to
demonstrate that Explain3D can derive precise explanations efficiently
Infinite Probabilistic Databases
Probabilistic databases (PDBs) model uncertainty in data in a quantitative
way. In the established formal framework, probabilistic (relational) databases
are finite probability spaces over relational database instances. This
finiteness can clash with intuitive query behavior (Ceylan et al., KR 2016),
and with application scenarios that are better modeled by continuous
probability distributions (Dalvi et al., CACM 2009).
We formally introduced infinite PDBs in (Grohe and Lindner, PODS 2019) with a
primary focus on countably infinite spaces. However, an extension beyond
countable probability spaces raises nontrivial foundational issues concerned
with the measurability of events and queries and ultimately with the question
whether queries have a well-defined semantics.
We argue that finite point processes are an appropriate model from
probability theory for dealing with general probabilistic databases. This
allows us to construct suitable (uncountable) probability spaces of database
instances in a systematic way. Our main technical results are measurability
statements for relational algebra queries as well as aggregate queries and
Datalog queries.Comment: This is the full version of the paper "Infinite Probabilistic
Databases" presented at ICDT 2020 (arXiv:1904.06766
Knowledge Spaces and Learning Spaces
How to design automated procedures which (i) accurately assess the knowledge
of a student, and (ii) efficiently provide advices for further study? To
produce well-founded answers, Knowledge Space Theory relies on a combinatorial
viewpoint on the assessment of knowledge, and thus departs from common,
numerical evaluation. Its assessment procedures fundamentally differ from other
current ones (such as those of S.A.T. and A.C.T.). They are adaptative (taking
into account the possible correctness of previous answers from the student) and
they produce an outcome which is far more informative than a crude numerical
mark. This chapter recapitulates the main concepts underlying Knowledge Space
Theory and its special case, Learning Space Theory. We begin by describing the
combinatorial core of the theory, in the form of two basic axioms and the main
ensuing results (most of which we give without proofs). In practical
applications, learning spaces are huge combinatorial structures which may be
difficult to manage. We outline methods providing efficient and comprehensive
summaries of such large structures. We then describe the probabilistic part of
the theory, especially the Markovian type processes which are instrumental in
uncovering the knowledge states of individuals. In the guise of the ALEKS
system, which includes a teaching component, these methods have been used by
millions of students in schools and colleges, and by home schooled students. We
summarize some of the results of these applications
- …