3,130 research outputs found
Recommended from our members
Set-related restrictions for semantic groupings
Semantic database models utilize several fundamental forms of groupings to increase their expressive power. In this paper we consider four of the most common of these constructs; basic set groupings, is-a related groupings, power set groupings, and Cartesian aggregation groupings. For each, we define a number of useful restrictions that control its structure and composition. This permits each grouping to capture more subtle distinctions of the concepts or situations in the application environment. The resulting set of restrictions forms a framework which increases the expressive power of semantic models and specifies various set-related integrity constraints
Explain3D: Explaining Disagreements in Disjoint Datasets
Data plays an important role in applications, analytic processes, and many
aspects of human activity. As data grows in size and complexity, we are met
with an imperative need for tools that promote understanding and explanations
over data-related operations. Data management research on explanations has
focused on the assumption that data resides in a single dataset, under one
common schema. But the reality of today's data is that it is frequently
un-integrated, coming from different sources with different schemas. When
different datasets provide different answers to semantically similar questions,
understanding the reasons for the discrepancies is challenging and cannot be
handled by the existing single-dataset solutions.
In this paper, we propose Explain3D, a framework for explaining the
disagreements across disjoint datasets (3D). Explain3D focuses on identifying
the reasons for the differences in the results of two semantically similar
queries operating on two datasets with potentially different schemas. Our
framework leverages the queries to perform a semantic mapping across the
relevant parts of their provenance; discrepancies in this mapping point to
causes of the queries' differences. Exploiting the queries gives Explain3D an
edge over traditional schema matching and record linkage techniques, which are
query-agnostic. Our work makes the following contributions: (1) We formalize
the problem of deriving optimal explanations for the differences of the results
of semantically similar queries over disjoint datasets. (2) We design a 3-stage
framework for solving the optimal explanation problem. (3) We develop a
smart-partitioning optimizer that improves the efficiency of the framework by
orders of magnitude. (4)~We experiment with real-world and synthetic data to
demonstrate that Explain3D can derive precise explanations efficiently
Database Learning: Toward a Database that Becomes Smarter Every Time
In today's databases, previous query answers rarely benefit answering future
queries. For the first time, to the best of our knowledge, we change this
paradigm in an approximate query processing (AQP) context. We make the
following observation: the answer to each query reveals some degree of
knowledge about the answer to another query because their answers stem from the
same underlying distribution that has produced the entire dataset. Exploiting
and refining this knowledge should allow us to answer queries more
analytically, rather than by reading enormous amounts of raw data. Also,
processing more queries should continuously enhance our knowledge of the
underlying distribution, and hence lead to increasingly faster response times
for future queries.
We call this novel idea---learning from past query answers---Database
Learning. We exploit the principle of maximum entropy to produce answers, which
are in expectation guaranteed to be more accurate than existing sample-based
approximations. Empowered by this idea, we build a query engine on top of Spark
SQL, called Verdict. We conduct extensive experiments on real-world query
traces from a large customer of a major database vendor. Our results
demonstrate that Verdict supports 73.7% of these queries, speeding them up by
up to 23.0x for the same accuracy level compared to existing AQP systems.Comment: This manuscript is an extended report of the work published in ACM
SIGMOD conference 201
From Sparse Signals to Sparse Residuals for Robust Sensing
One of the key challenges in sensor networks is the extraction of information
by fusing data from a multitude of distinct, but possibly unreliable sensors.
Recovering information from the maximum number of dependable sensors while
specifying the unreliable ones is critical for robust sensing. This sensing
task is formulated here as that of finding the maximum number of feasible
subsystems of linear equations, and proved to be NP-hard. Useful links are
established with compressive sampling, which aims at recovering vectors that
are sparse. In contrast, the signals here are not sparse, but give rise to
sparse residuals. Capitalizing on this form of sparsity, four sensing schemes
with complementary strengths are developed. The first scheme is a convex
relaxation of the original problem expressed as a second-order cone program
(SOCP). It is shown that when the involved sensing matrices are Gaussian and
the reliable measurements are sufficiently many, the SOCP can recover the
optimal solution with overwhelming probability. The second scheme is obtained
by replacing the initial objective function with a concave one. The third and
fourth schemes are tailored for noisy sensor data. The noisy case is cast as a
combinatorial problem that is subsequently surrogated by a (weighted) SOCP.
Interestingly, the derived cost functions fall into the framework of robust
multivariate linear regression, while an efficient block-coordinate descent
algorithm is developed for their minimization. The robust sensing capabilities
of all schemes are verified by simulated tests.Comment: Under review for publication in the IEEE Transactions on Signal
Processing (revised version
Fast and Reliable Missing Data Contingency Analysis with Predicate-Constraints
Today, data analysts largely rely on intuition to determine whether missing
or withheld rows of a dataset significantly affect their analyses. We propose a
framework that can produce automatic contingency analysis, i.e., the range of
values an aggregate SQL query could take, under formal constraints describing
the variation and frequency of missing data tuples. We describe how to process
SUM, COUNT, AVG, MIN, and MAX queries in these conditions resulting in hard
error bounds with testable constraints. We propose an optimization algorithm
based on an integer program that reconciles a set of such constraints, even if
they are overlapping, conflicting, or unsatisfiable, into such bounds. Our
experiments on real-world datasets against several statistical imputation and
inference baselines show that statistical techniques can have a deceptively
high error rate that is often unpredictable. In contrast, our framework offers
hard bounds that are guaranteed to hold if the constraints are not violated. In
spite of these hard bounds, we show competitive accuracy to statistical
baselines
CubiST++: Evaluating Ad-Hoc CUBE Queries Using Statistics Trees
We report on a new, efficient encoding for the data cube, which results in a drastic speed-up of OLAP queries that aggregate along any combination of dimensions over numerical and categorical attributes. We are focusing on a class of queries called cube queries, which return aggregated values rather than sets of tuples. Our approach, termed CubiST++ (Cubing with Statistics Trees Plus Families), represents a drastic departure from existing relational (ROLAP) and multi-dimensional (MOLAP) approaches in that it does not use the view lattice to compute and materialize new views from existing views in some heuristic fashion. Instead, CubiST++ encodes all possible aggregate views in the leaves of a new data structure called statistics tree (ST) during a one-time scan of the detailed data. In order to optimize the queries involving constraints on hierarchy levels of the underlying dimensions, we select and materialize a family of candidate trees, which represent superviews over the different hierarchical levels of the dimensions. Given a query, our query evaluation algorithm selects the smallest tree in the family, which can provide the answer. Extensive evaluations of our prototype implementation have demonstrated its superior run-time performance and scalability when compared with existing MOLAP and ROLAP systems
Kaskade: Graph Views for Efficient Graph Analytics
Graphs are an increasingly popular way to model real-world entities and
relationships between them, ranging from social networks to data lineage graphs
and biological datasets. Queries over these large graphs often involve
expensive subgraph traversals and complex analytical computations. These
real-world graphs are often substantially more structured than a generic
vertex-and-edge model would suggest, but this insight has remained mostly
unexplored by existing graph engines for graph query optimization purposes.
Therefore, in this work, we focus on leveraging structural properties of graphs
and queries to automatically derive materialized graph views that can
dramatically speed up query evaluation. We present KASKADE, the first graph
query optimization framework to exploit materialized graph views for query
optimization purposes. KASKADE employs a novel constraint-based view
enumeration technique that mines constraints from query workloads and graph
schemas, and injects them during view enumeration to significantly reduce the
search space of views to be considered. Moreover, it introduces a graph view
size estimator to pick the most beneficial views to materialize given a query
set and to select the best query evaluation plan given a set of materialized
views. We evaluate its performance over real-world graphs, including the
provenance graph that we maintain at Microsoft to enable auditing, service
analytics, and advanced system optimizations. Our results show that KASKADE
substantially reduces the effective graph size and yields significant
performance speedups (up to 50X), in some cases making otherwise intractable
queries possible
- …