409 research outputs found
A Theory of Pricing Private Data
Personal data has value to both its owner and to institutions who would like
to analyze it. Privacy mechanisms protect the owner's data while releasing to
analysts noisy versions of aggregate query results. But such strict protections
of individual's data have not yet found wide use in practice. Instead, Internet
companies, for example, commonly provide free services in return for valuable
sensitive information from users, which they exploit and sometimes sell to
third parties.
As the awareness of the value of the personal data increases, so has the
drive to compensate the end user for her private information. The idea of
monetizing private data can improve over the narrower view of hiding private
data, since it empowers individuals to control their data through financial
means.
In this paper we propose a theoretical framework for assigning prices to
noisy query answers, as a function of their accuracy, and for dividing the
price amongst data owners who deserve compensation for their loss of privacy.
Our framework adopts and extends key principles from both differential privacy
and query pricing in data markets. We identify essential properties of the
price function and micro-payments, and characterize valid solutions.Comment: 25 pages, 2 figures. Best Paper Award, to appear in the 16th
International Conference on Database Theory (ICDT), 201
When Can We Answer Queries Using Result-Bounded Data Interfaces?
We consider answering queries on data available through access methods, that
provide lookup access to the tuples matching a given binding. Such interfaces
are common on the Web; further, they often have bounds on how many results they
can return, e.g., because of pagination or rate limits. We thus study
result-bounded methods, which may return only a limited number of tuples. We
study how to decide if a query is answerable using result-bounded methods,
i.e., how to compute a plan that returns all answers to the query using the
methods, assuming that the underlying data satisfies some integrity
constraints. We first show how to reduce answerability to a query containment
problem with constraints. Second, we show "schema simplification" theorems
describing when and how result bounded services can be used. Finally, we use
these theorems to give decidability and complexity results about answerability
for common constraint classes.Comment: 65 pages; journal version of the PODS'18 paper arXiv:1706.0793
When Can We Answer Queries Using Result-Bounded Data Interfaces?
We consider answering queries where the underlying data is available only
over limited interfaces which provide lookup access to the tuples matching a
given binding, but possibly restricting the number of output tuples returned.
Interfaces imposing such "result bounds" are common in accessing data via the
web. Given a query over a set of relations as well as some integrity
constraints that relate the queried relations to the data sources, we examine
the problem of deciding if the query is answerable over the interfaces; that
is, whether there exists a plan that returns all answers to the query, assuming
the source data satisfies the integrity constraints.
The first component of our analysis of answerability is a reduction to a
query containment problem with constraints. The second component is a set of
"schema simplification" theorems capturing limitations on how interfaces with
result bounds can be useful to obtain complete answers to queries. These
results also help to show decidability for the containment problem that
captures answerability, for many classes of constraints. The final component in
our analysis of answerability is a "linearization" method, showing that query
containment with certain guarded dependencies -- including those that emerge
from answerability problems -- can be reduced to query containment for a
well-behaved class of linear dependencies. Putting these components together,
we get a detailed picture of how to check answerability over result-bounded
services.Comment: 45 pages, 2 tables, 43 references. Complete version with proofs of
the PODS'18 paper. The main text of this paper is almost identical to the
PODS'18 except that we have fixed some small mistakes. Relative to the
earlier arXiv version, many errors were corrected, and some terminology has
change
How Many and What Types of SPARQL Queries can be Answered through Zero-Knowledge Link Traversal?
The current de-facto way to query the Web of Data is through the SPARQL
protocol, where a client sends queries to a server through a SPARQL endpoint.
Contrary to an HTTP server, providing and maintaining a robust and reliable
endpoint requires a significant effort that not all publishers are willing or
able to make. An alternative query evaluation method is through link traversal,
where a query is answered by dereferencing online web resources (URIs) at real
time. While several approaches for such a lookup-based query evaluation method
have been proposed, there exists no analysis of the types (patterns) of queries
that can be directly answered on the live Web, without accessing local or
remote endpoints and without a-priori knowledge of available data sources. In
this paper, we first provide a method for checking if a SPARQL query (to be
evaluated on a SPARQL endpoint) can be answered through zero-knowledge link
traversal (without accessing the endpoint), and analyse a large corpus of real
SPARQL query logs for finding the frequency and distribution of answerable and
non-answerable query patterns. Subsequently, we provide an algorithm for
transforming answerable queries to SPARQL-LD queries that bypass the endpoints.
We report experimental results about the efficiency of the transformed queries
and discuss the benefits and the limitations of this query evaluation method.Comment: Preprint of paper accepted for publication in the 34th ACM/SIGAPP
Symposium On Applied Computing (SAC 2019
Materialized View Selection in XML Databases
Materialized views, a rdbms silver bullet, demonstrate its
efficacy in many applications, especially as a data warehousing/decison support system tool. The pivot of playing materialized views efficiently is view selection. Though studied for over thirty years in rdbms, the
selection is hard to make in the context of xml databases, where both the semi-structured data and the expressiveness of xml query languages add challenges to the view selection problem. We start our discussion on producing minimal xml views (in terms of size) as candidates for a given workload (a query set). To facilitate intuitionistic view selection, we present a view graph (called vcube) to structurally maintain all generated views. By basing our selection on vcube for materialization, we propose two view selection strategies, targeting at space-optimized and space-time tradeoff, respectively. We built our implementation on
top of Berkeley DB XML, demonstrating that significant performance improvement could be obtained using our proposed approaches
Intermediate Results Materialization Selection and Format for Data-Intensive Flows
Data-intensive flows deploy a variety of complex data transformations to build information pipelines from data sources to different end users. As data are processed, these workflows generate large intermediate results, typically pipelined from one operator to the following ones. Materializing intermediate results, shared among multiple flows, brings benefits not only in terms of performance but also in resource usage and consistency. Similar ideas have been proposed in the context of data warehouses, which are studied under the materialized view selection problem. With the rise of Big Data systems, new challenges emerge due to new quality metrics captured by service level agreements which must be taken into account. Moreover, the way such results are stored must be reconsidered, as different data layouts can be used to reduce the I/O cost. In this paper, we propose a novel approach for automatic selection of multi-objective materialization of intermediate results in data-intensive flows, which can tackle multiple and conflicting quality objectives. In addition, our approach chooses the optimal storage data format for selected materialized intermediate results based on subsequent access patterns. The experimental results show that our approach provides 40% better average speedup with respect to the current state-of-the-art, as well as an improvement on disk access time of 18% as compared to fixed format solutions
- …