1,158 research outputs found
Tight Lower Bound for Comparison-Based Quantile Summaries
Quantiles, such as the median or percentiles, provide concise and useful
information about the distribution of a collection of items, drawn from a
totally ordered universe. We study data structures, called quantile summaries,
which keep track of all quantiles, up to an error of at most .
That is, an -approximate quantile summary first processes a stream
of items and then, given any quantile query , returns an item
from the stream, which is a -quantile for some . We focus on comparison-based quantile summaries that can only
compare two items and are otherwise completely oblivious of the universe.
The best such deterministic quantile summary to date, due to Greenwald and
Khanna (SIGMOD '01), stores at most items, where is the number of items in the stream. We prove
that this space bound is optimal by showing a matching lower bound. Our result
thus rules out the possibility of constructing a deterministic comparison-based
quantile summary in space , for any function
that does not depend on . As a corollary, we improve the lower bound for
biased quantiles, which provide a stronger, relative-error guarantee of , and for other related computational tasks.Comment: 20 pages, 2 figures, major revison of the construction (Sec. 3) and
some other parts of the pape
Streaming algorithms for bin packing and vector scheduling
Problems involving the efficient arrangement of simple objects, as captured by bin packing and makespan scheduling, are fundamental tasks in combinatorial optimization. These are well understood in the traditional online and offline cases, but have been less well-studied when the volume of the input is truly massive, and cannot even be read into memory. This is captured by the streaming model of computation, where the aim is to approximate the cost of the solution in one pass over the data, using small space. As a result, streaming algorithms produce concise input summaries that approximately preserve the optimum value. We design the first efficient streaming algorithms for these fundamental problems in combinatorial optimization. For BIN PACKING, we provide a streaming asymptotic (1 + ε)-approximation wit
Streaming algorithms for bin packing and vector scheduling
Problems involving the efficient arrangement of simple objects, as captured by bin packing and makespan scheduling, are fundamental tasks in combinatorial optimization. These are well understood in the traditional online and offline cases, but have been less well-studied when the volume of the input is truly massive, and cannot even be read into memory. This is captured by the streaming model of computation, where the aim is to approximate the cost of the solution in one pass over the data, using small space. As a result, streaming algorithms produce concise input summaries that approximately preserve the optimum value. We design the first efficient streaming algorithms for these fundamental problems in combinatorial optimization. For BIN PACKING, we provide a streaming asymptotic (1 + ε)-approximation wit
Streaming algorithms for bin packing and vector scheduling
Problems involving the efficient arrangement of simple objects, as captured by bin packing and makespan scheduling, are fundamental tasks in combinatorial optimization. These are well understood in the traditional online and offline cases, but have been less well-studied when the volume of the input is truly massive, and cannot even be read into memory. This is captured by the streaming model of computation, where the aim is to approximate the cost of the solution in one pass over the data, using small space. As a result, streaming algorithms produce concise input summaries that approximately preserve the optimum value. We design the first efficient streaming algorithms for these fundamental problems in combinatorial optimization. For BIN PACKING, we provide a streaming asymptotic (1 + ε)-approximation wit
Generalizing Greenwald-Khanna Streaming Quantile Summaries for Weighted Inputs
Estimating quantiles, like the median or percentiles, is a fundamental task
in data mining and data science. A (streaming) quantile summary is a data
structure that can process a set S of n elements in a streaming fashion and at
the end, for any phi in (0,1], return a phi-quantile of S up to an eps error,
i.e., return a phi'-quantile with phi'=phi +- eps. We are particularly
interested in comparison-based summaries that only compare elements of the
universe under a total ordering and are otherwise completely oblivious of the
universe. The best known deterministic quantile summary is the 20-year old
Greenwald-Khanna (GK) summary that uses O((1/eps) log(eps n)) space
[SIGMOD'01]. This bound was recently proved to be optimal for all deterministic
comparison-based summaries by Cormode and Vesle\'y [PODS'20].
In this paper, we study weighted quantiles, a generalization of the quantiles
problem, where each element arrives with a positive integer weight which
denotes the number of copies of that element being inserted. The only known
method of handling weighted inputs via GK summaries is the naive approach of
breaking each weighted element into multiple unweighted items and feeding them
one by one to the summary, which results in a prohibitively large update time
(proportional to the maximum weight of input elements).
We give the first non-trivial extension of GK summaries for weighted inputs
and show that it takes O((1/eps) log(eps n)) space and O(log(1/eps)+ log
log(eps n)) update time per element to process a stream of length n (under some
quite mild assumptions on the range of weights and eps). En route to this, we
also simplify the original GK summaries for unweighted quantiles.Comment: 33 pages, 7 figures, International Conference on Database Theory 202
Revisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods
This paper has been replaced with http://digitalcommons.ilr.cornell.edu/ldi/37.
We consider the problem of the public release of statistical information about a population–explicitly accounting for the public-good properties of both data accuracy and privacy loss. We first consider the implications of adding the public-good component to recently published models of private data publication under differential privacy guarantees using a Vickery-Clark-Groves mechanism and a Lindahl mechanism. We show that data quality will be inefficiently under-supplied. Next, we develop a standard social planner’s problem using the technology set implied by (ε, δ)-differential privacy with (α, β)-accuracy for the Private Multiplicative Weights query release mechanism to study the properties of optimal provision of data accuracy and privacy loss when both are public goods. Using the production possibilities frontier implied by this technology, explicitly parameterized interdependent preferences, and the social welfare function, we display properties of the solution to the social planner’s problem. Our results directly quantify the optimal choice of data accuracy and privacy loss as functions of the technology and preference parameters. Some of these properties can be quantified using population statistics on marginal preferences and correlations between income, data accuracy preferences, and privacy loss preferences that are available from survey data. Our results show that government data custodians should publish more accurate statistics with weaker privacy guarantees than would occur with purely private data publishing. Our statistical results using the General Social Survey and the Cornell National Social Survey indicate that the welfare losses from under-providing data accuracy while over-providing privacy protection can be substantial
Distributed top-k aggregation queries at large
Top-k query processing is a fundamental building block for efficient ranking in a large number of applications. Efficiency is a central issue, especially for distributed settings, when the data is spread across different nodes in a network. This paper introduces novel optimization methods for top-k aggregation queries in such distributed environments. The optimizations can be applied to all algorithms that fall into the frameworks of the prior TPUT and KLEE methods. The optimizations address three degrees of freedom: 1) hierarchically grouping input lists into top-k operator trees and optimizing the tree structure, 2) computing data-adaptive scan depths for different input sources, and 3) data-adaptive sampling of a small subset of input sources in scenarios with hundreds or thousands of query-relevant network nodes. All optimizations are based on a statistical cost model that utilizes local synopses, e.g., in the form of histograms, efficiently computed convolutions, and estimators based on order statistics. The paper presents comprehensive experiments, with three different real-life datasets and using the ns-2 network simulator for a packet-level simulation of a large Internet-style network
- …