8,115 research outputs found
New Algorithms and Lower Bounds for Sequential-Access Data Compression
This thesis concerns sequential-access data compression, i.e., by algorithms
that read the input one or more times from beginning to end. In one chapter we
consider adaptive prefix coding, for which we must read the input character by
character, outputting each character's self-delimiting codeword before reading
the next one. We show how to encode and decode each character in constant
worst-case time while producing an encoding whose length is worst-case optimal.
In another chapter we consider one-pass compression with memory bounded in
terms of the alphabet size and context length, and prove a nearly tight
tradeoff between the amount of memory we can use and the quality of the
compression we can achieve. In a third chapter we consider compression in the
read/write streams model, which allows us passes and memory both
polylogarithmic in the size of the input. We first show how to achieve
universal compression using only one pass over one stream. We then show that
one stream is not sufficient for achieving good grammar-based compression.
Finally, we show that two streams are necessary and sufficient for achieving
entropy-only bounds.Comment: draft of PhD thesi
Smooth heaps and a dual view of self-adjusting data structures
We present a new connection between self-adjusting binary search trees (BSTs)
and heaps, two fundamental, extensively studied, and practically relevant
families of data structures. Roughly speaking, we map an arbitrary heap
algorithm within a natural model, to a corresponding BST algorithm with the
same cost on a dual sequence of operations (i.e. the same sequence with the
roles of time and key-space switched). This is the first general transformation
between the two families of data structures.
There is a rich theory of dynamic optimality for BSTs (i.e. the theory of
competitiveness between BST algorithms). The lack of an analogous theory for
heaps has been noted in the literature. Through our connection, we transfer all
instance-specific lower bounds known for BSTs to a general model of heaps,
initiating a theory of dynamic optimality for heaps.
On the algorithmic side, we obtain a new, simple and efficient heap
algorithm, which we call the smooth heap. We show the smooth heap to be the
heap-counterpart of Greedy, the BST algorithm with the strongest proven and
conjectured properties from the literature, widely believed to be
instance-optimal. Assuming the optimality of Greedy, the smooth heap is also
optimal within our model of heap algorithms. As corollaries of results known
for Greedy, we obtain instance-specific upper bounds for the smooth heap, with
applications in adaptive sorting.
Intriguingly, the smooth heap, although derived from a non-practical BST
algorithm, is simple and easy to implement (e.g. it stores no auxiliary data
besides the keys and tree pointers). It can be seen as a variation on the
popular pairing heap data structure, extending it with a "power-of-two-choices"
type of heuristic.Comment: Presented at STOC 2018, light revision, additional figure
Diamond Dicing
In OLAP, analysts often select an interesting sample of the data. For
example, an analyst might focus on products bringing revenues of at least 100
000 dollars, or on shops having sales greater than 400 000 dollars. However,
current systems do not allow the application of both of these thresholds
simultaneously, selecting products and shops satisfying both thresholds. For
such purposes, we introduce the diamond cube operator, filling a gap among
existing data warehouse operations.
Because of the interaction between dimensions the computation of diamond
cubes is challenging. We compare and test various algorithms on large data sets
of more than 100 million facts. We find that while it is possible to implement
diamonds in SQL, it is inefficient. Indeed, our custom implementation can be a
hundred times faster than popular database engines (including a row-store and a
column-store).Comment: 29 page
Maximally Consistent Sampling and the Jaccard Index of Probability Distributions
We introduce simple, efficient algorithms for computing a MinHash of a
probability distribution, suitable for both sparse and dense data, with
equivalent running times to the state of the art for both cases. The collision
probability of these algorithms is a new measure of the similarity of positive
vectors which we investigate in detail. We describe the sense in which this
collision probability is optimal for any Locality Sensitive Hash based on
sampling. We argue that this similarity measure is more useful for probability
distributions than the similarity pursued by other algorithms for weighted
MinHash, and is the natural generalization of the Jaccard index.Comment: To appear in ICDMW 201
Budget Feasible Mechanisms
We study a novel class of mechanism design problems in which the outcomes are
constrained by the payments. This basic class of mechanism design problems
captures many common economic situations, and yet it has not been studied, to
our knowledge, in the past. We focus on the case of procurement auctions in
which sellers have private costs, and the auctioneer aims to maximize a utility
function on subsets of items, under the constraint that the sum of the payments
provided by the mechanism does not exceed a given budget. Standard mechanism
design ideas such as the VCG mechanism and its variants are not applicable
here. We show that, for general functions, the budget constraint can render
mechanisms arbitrarily bad in terms of the utility of the buyer. However, our
main result shows that for the important class of submodular functions, a
bounded approximation ratio is achievable. Better approximation results are
obtained for subclasses of the submodular functions. We explore the space of
budget feasible mechanisms in other domains and give a characterization under
more restricted conditions
- …