1,527 research outputs found
Fast evaluation of union-intersection expressions
We show how to represent sets in a linear space data structure such that
expressions involving unions and intersections of sets can be computed in a
worst-case efficient way. This problem has applications in e.g. information
retrieval and database systems. We mainly consider the RAM model of
computation, and sets of machine words, but also state our results in the I/O
model. On a RAM with word size , a special case of our result is that the
intersection of (preprocessed) sets, containing elements in total, can
be computed in expected time , where is the
number of elements in the intersection. If the first of the two terms
dominates, this is a factor faster than the standard solution of
merging sorted lists. We show a cell probe lower bound of time , meaning that our upper bound is nearly
optimal for small . Our algorithm uses a novel combination of approximate
set representations and word-level parallelism
Finding Associations and Computing Similarity via Biased Pair Sampling
This version is ***superseded*** by a full version that can be found at
http://www.itu.dk/people/pagh/papers/mining-jour.pdf, which contains stronger
theoretical results and fixes a mistake in the reporting of experiments.
Abstract: Sampling-based methods have previously been proposed for the
problem of finding interesting associations in data, even for low-support
items. While these methods do not guarantee precise results, they can be vastly
more efficient than approaches that rely on exact counting. However, for many
similarity measures no such methods have been known. In this paper we show how
a wide variety of measures can be supported by a simple biased sampling method.
The method also extends to find high-confidence association rules. We
demonstrate theoretically that our method is superior to exact methods when the
threshold for "interesting similarity/confidence" is above the average pairwise
similarity/confidence, and the average support is not too low. Our method is
particularly good when transactions contain many items. We confirm in
experiments on standard association mining benchmarks that this gives a
significant speedup on real data sets (sometimes much larger than the
theoretical guarantees). Reductions in computation time of over an order of
magnitude, and significant savings in space, are observed.Comment: This is an extended version of a paper that appeared at the IEEE
International Conference on Data Mining, 2009. The conference version is (c)
2009 IEE
Secondary Indexing in One Dimension: Beyond B-trees and Bitmap Indexes
Let S be a finite, ordered alphabet, and let x = x_1 x_2 ... x_n be a string
over S. A "secondary index" for x answers alphabet range queries of the form:
Given a range [a_l,a_r] over S, return the set I_{[a_l;a_r]} = {i |x_i \in
[a_l; a_r]}. Secondary indexes are heavily used in relational databases and
scientific data analysis. It is well-known that the obvious solution, storing a
dictionary for the position set associated with each character, does not always
give optimal query time. In this paper we give the first theoretically optimal
data structure for the secondary indexing problem. In the I/O model, the amount
of data read when answering a query is within a constant factor of the minimum
space needed to represent I_{[a_l;a_r]}, assuming that the size of internal
memory is (|S| log n)^{delta} blocks, for some constant delta > 0. The space
usage of the data structure is O(n log |S|) bits in the worst case, and we
further show how to bound the size of the data structure in terms of the 0-th
order entropy of x. We show how to support updates achieving various time-space
trade-offs.
We also consider an approximate version of the basic secondary indexing
problem where a query reports a superset of I_{[a_l;a_r]} containing each
element not in I_{[a_l;a_r]} with probability at most epsilon, where epsilon >
0 is the false positive probability. For this problem the amount of data that
needs to be read by the query algorithm is reduced to O(|I_{[a_l;a_r]}|
log(1/epsilon)) bits.Comment: 16 page
Thresholds for Extreme Orientability
Multiple-choice load balancing has been a topic of intense study since the
seminal paper of Azar, Broder, Karlin, and Upfal. Questions in this area can be
phrased in terms of orientations of a graph, or more generally a k-uniform
random hypergraph. A (d,b)-orientation is an assignment of each edge to d of
its vertices, such that no vertex has more than b edges assigned to it.
Conditions for the existence of such orientations have been completely
documented except for the "extreme" case of (k-1,1)-orientations. We consider
this remaining case, and establish:
- The density threshold below which an orientation exists with high
probability, and above which it does not exist with high probability.
- An algorithm for finding an orientation that runs in linear time with high
probability, with explicit polynomial bounds on the failure probability.
Previously, the only known algorithms for constructing (k-1,1)-orientations
worked for k<=3, and were only shown to have expected linear running time.Comment: Corrected description of relationship to the work of LeLarg
Towards Interactive, Incremental Programming of ROS Nodes
Writing software for controlling robots is a complex task, usually demanding
command of many programming languages and requiring significant
experimentation. We believe that a bottom-up development process that
complements traditional component- and MDSD-based approaches can facilitate
experimentation. We propose the use of an internal DSL providing both a tool to
interactively create ROS nodes and a behaviour-replacement mechanism to
interactively reshape existing ROS nodes by wrapping the external interfaces
(the publish/subscribe topics), dynamically controlled using the Python command
line interface.Comment: Presented at DSLRob 2014 (arXiv:cs/1411.7148
Linear probing with constant independence
Hashing with linear probing dates back to the 1950s, and is among the most studied algorithms. In recent years it has become one of the most important hash table organizations since it uses the cache of modern computers very well. Unfortunately, previous analyses rely either on complicated and space consuming hash functions, or on the unrealistic assumption of free access to a truly random hash function. Already Carter and Wegman, in their seminal paper on universal hashing, raised the question of extending their analysis to linear probing. However, we show in this paper that linear probing using a pairwise independent family may have expected logarithmic cost per operation. On the positive side, we show that 5-wise independence is enough to ensure constant expected time per operation. This resolves the question of finding a space and time efficient hash function that provably ensures good performance for linear probing
- …