119 research outputs found
More Analysis of Double Hashing for Balanced Allocations
With double hashing, for a key , one generates two hash values and
, and then uses combinations for
to generate multiple hash values in the range from the initial two.
For balanced allocations, keys are hashed into a hash table where each bucket
can hold multiple keys, and each key is placed in the least loaded of
choices. It has been shown previously that asymptotically the performance of
double hashing and fully random hashing is the same in the balanced allocation
paradigm using fluid limit methods. Here we extend a coupling argument used by
Lueker and Molodowitch to show that double hashing and ideal uniform hashing
are asymptotically equivalent in the setting of open address hash tables to the
balanced allocation setting, providing further insight into this phenomenon. We
also discuss the potential for and bottlenecks limiting the use this approach
for other multiple choice hashing schemes.Comment: 13 pages ; current draft ; will be submitted to conference shortl
Dense peelable random uniform hypergraphs
We describe a new family of k-uniform hypergraphs with independent random edges. The hypergraphs have a high probability of being peelable, i.e. to admit no sub-hypergraph of minimum degree 2, even when the edge density (number of edges over vertices) is close to 1.
In our construction, the vertex set is partitioned into linearly arranged segments and each edge is incident to random vertices of k consecutive segments. Quite surprisingly, the linear geometry allows our graphs to be peeled "from the outside in". The density thresholds f_k for peelability of our hypergraphs (f_3 ~ 0.918, f_4 ~ 0.977, f_5 ~ 0.992, ...) are well beyond the corresponding thresholds (c_3 ~ 0.818, c_4 ~ 0.772, c_5 ~ 0.702, ...) of standard k-uniform random hypergraphs.
To get a grip on f_k, we analyse an idealised peeling process on the random weak limit of our hypergraph family. The process can be described in terms of an operator on [0,1]^Z and f_k can be linked to thresholds relating to the operator. These thresholds are then tractable with numerical methods.
Random hypergraphs underlie the construction of various data structures based on hashing, for instance invertible Bloom filters, perfect hash functions, retrieval data structures, error correcting codes and cuckoo hash tables, where inputs are mapped to edges using hash functions. Frequently, the data structures rely on peelability of the hypergraph or peelability allows for simple linear time algorithms. Memory efficiency is closely tied to edge density while worst and average case query times are tied to maximum and average edge size.
To demonstrate the usefulness of our construction, we used our 3-uniform hypergraphs as a drop-in replacement for the standard 3-uniform hypergraphs in a retrieval data structure by Botelho et al. [Fabiano Cupertino Botelho et al., 2013]. This reduces memory usage from 1.23m bits to 1.12m bits (m being the input size) with almost no change in running time. Using k > 3 attains, at small sacrifices in running time, further improvements to memory usage
Random hypergraphs for hashing-based data structures
This thesis concerns dictionaries and related data structures that rely on providing several random possibilities for storing each key. Imagine information on a set S of m = |S| keys should be stored in n memory locations, indexed by [n] = {1,âŠ,n}. Each object x [ELEMENT OF] S is assigned a small set e(x) [SUBSET OF OR EQUAL TO] [n] of locations by a random hash function, independent of other objects. Information on x must then be stored in the locations from e(x) only. It is possible that too many objects compete for the same locations, in particular if the load c = m/n is high. Successfully storing all information may then be impossible. For most distributions of e(x), however, success or failure can be predicted very reliably, since the success probability is close to 1 for loads c less than a certain load threshold c^* and close to 0 for loads greater than this load threshold. We mainly consider two types of data structures: âą A cuckoo hash table is a dictionary data structure where each key x [ELEMENT OF] S is stored together with an associated value f(x) in one of the memory locations with an index from e(x). The distribution of e(x) is controlled by the hashing scheme. We analyse three known hashing schemes, and determine their exact load thresholds. The schemes are unaligned blocks, double hashing and a scheme for dynamically growing key sets. âą A retrieval data structure also stores a value f(x) for each x [ELEMENT OF] S. This time, the values stored in the memory locations from e(x) must satisfy a linear equation that characterises the value f(x). The resulting data structure is extremely compact, but unusual. It cannot answer questions of the form âis y [ELEMENT OF] S?â. Given a key y it returns a value z. If y [ELEMENT OF] S, then z = f(y) is guaranteed, otherwise z may be an arbitrary value. We consider two new hashing schemes, where the elements of e(x) are contained in one or two contiguous blocks. This yields good access times on a word RAM and high cache efficiency. An important question is whether these types of data structures can be constructed in linear time. The success probability of a natural linear time greedy algorithm exhibits, once again, threshold behaviour with respect to the load c. We identify a hashing scheme that leads to a particularly high threshold value in this regard. In the mathematical model, the memory locations [n] correspond to vertices, and the sets e(x) for x [ELEMENT OF] S correspond to hyperedges. Three properties of the resulting hypergraphs turn out to be important: peelability, solvability and orientability. Therefore, large parts of this thesis examine how hyperedge distribution and load affects the probabilities with which these properties hold and derive corresponding thresholds. Translated back into the world of data structures, we achieve low access times, high memory efficiency and low construction times. We complement and support the theoretical results by experiments.Diese Arbeit behandelt WörterbĂŒcher und verwandte Datenstrukturen, die darauf aufbauen, mehrere zufĂ€llige Möglichkeiten zur Speicherung jedes SchlĂŒssels vorzusehen. Man stelle sich vor, Information ĂŒber eine Menge S von m = |S| SchlĂŒsseln soll in n SpeicherplĂ€tzen abgelegt werden, die durch [n] = {1,âŠ,n} indiziert sind. Jeder SchlĂŒssel x [ELEMENT OF] S bekommt eine kleine Menge e(x) [SUBSET OF OR EQUAL TO] [n] von SpeicherplĂ€tzen durch eine zufĂ€llige Hashfunktion unabhĂ€ngig von anderen SchlĂŒsseln zugewiesen. Die Information ĂŒber x darf nun ausschlieĂlich in den PlĂ€tzen aus e(x) untergebracht werden. Es kann hierbei passieren, dass zu viele SchlĂŒssel um dieselben SpeicherplĂ€tze konkurrieren, insbesondere bei hoher Auslastung c = m/n. Eine erfolgreiche Speicherung der Gesamtinformation ist dann eventuell unmöglich. FĂŒr die meisten Verteilungen von e(x) lĂ€sst sich Erfolg oder Misserfolg allerdings sehr zuverlĂ€ssig vorhersagen, da fĂŒr Auslastung c unterhalb eines gewissen Auslastungsschwellwertes c* die Erfolgswahrscheinlichkeit nahezu 1 ist und fĂŒr c jenseits dieses Auslastungsschwellwertes nahezu 0 ist. HauptsĂ€chlich werden wir zwei Arten von Datenstrukturen betrachten: âą Eine Kuckucks-Hashtabelle ist eine Wörterbuchdatenstruktur, bei der jeder SchlĂŒssel x [ELEMENT OF] S zusammen mit einem assoziierten Wert f(x) in einem der SpeicherplĂ€tze mit Index aus e(x) gespeichert wird. Die Verteilung von e(x) wird hierbei vom Hashing-Schema festgelegt. Wir analysieren drei bekannte Hashing-Schemata und bestimmen erstmals deren exakte Auslastungsschwellwerte im obigen Sinne. Die Schemata sind unausgerichtete Blöcke, Doppel-Hashing sowie ein Schema fĂŒr dynamisch wachsenden SchlĂŒsselmengen. âą Auch eine Retrieval-Datenstruktur speichert einen Wert f(x) fĂŒr alle x [ELEMENT OF] S. Diesmal sollen die Werte in den SpeicherplĂ€tzen aus e(x) eine lineare Gleichung erfĂŒllen, die den Wert f(x) charakterisiert. Die entstehende Datenstruktur ist extrem platzsparend, aber ungewöhnlich: Sie ist ungeeignet um Fragen der Form âist y [ELEMENT OF] S?â zu beantworten. Bei Anfrage eines SchlĂŒssels y wird ein Ergebnis z zurĂŒckgegeben. Falls y [ELEMENT OF] S ist, so ist z = f(y) garantiert, andernfalls darf z ein beliebiger Wert sein. Wir betrachten zwei neue Hashing-Schemata, bei denen die Elemente von e(x) in einem oder in zwei zusammenhĂ€ngenden Blöcken liegen. So werden gute Zugriffszeiten auf Word-RAMs und eine hohe Cache-Effizienz erzielt. Eine wichtige Frage ist, ob Datenstrukturen obiger Art in Linearzeit konstruiert werden können. Die Erfolgswahrscheinlichkeit eines naheliegenden Greedy-Algorithmus weist abermals ein Schwellwertverhalten in Bezug auf die Auslastung c auf. Wir identifizieren ein Hashing-Schema, das diesbezĂŒglich einen besonders hohen Schwellwert mit sich bringt. In der mathematischen Modellierung werden die Speicherpositionen [n] als Knoten und die Mengen e(x) fĂŒr x [ELEMENT OF] S als Hyperkanten aufgefasst. Drei Eigenschaften der entstehenden Hypergraphen stellen sich dann als zentral heraus: SchĂ€lbarkeit, Lösbarkeit und Orientierbarkeit. Weite Teile dieser Arbeit beschĂ€ftigen sich daher mit den Wahrscheinlichkeiten fĂŒr das Vorliegen dieser Eigenschaften abhĂ€ngig von Hashing Schema und Auslastung, sowie mit entsprechenden Schwellwerten. Eine RĂŒckĂŒbersetzung der Ergebnisse liefert dann Datenstrukturen mit geringen Anfragezeiten, hoher Speichereffizienz und geringen Konstruktionszeiten. Die theoretischen Ăberlegungen werden dabei durch experimentelle Ergebnisse ergĂ€nzt und gestĂŒtzt
Balanced Allocations and Double Hashing
Double hashing has recently found more common usage in schemes that use
multiple hash functions. In double hashing, for an item , one generates two
hash values and , and then uses combinations for to generate multiple hash values from the initial two. We
first perform an empirical study showing that, surprisingly, the performance
difference between double hashing and fully random hashing appears negligible
in the standard balanced allocation paradigm, where each item is placed in the
least loaded of choices, as well as several related variants. We then
provide theoretical results that explain the behavior of double hashing in this
context.Comment: Further updated, small improvements/typos fixe
Linear Programming Relaxations for Goldreich's Generators over Non-Binary Alphabets
Goldreich suggested candidates of one-way functions and pseudorandom
generators included in . It is known that randomly generated
Goldreich's generator using -wise independent predicates with input
variables and output variables is not pseudorandom generator with
high probability for sufficiently large constant . Most of the previous
works assume that the alphabet is binary and use techniques available only for
the binary alphabet. In this paper, we deal with non-binary generalization of
Goldreich's generator and derives the tight threshold for linear programming
relaxation attack using local marginal polytope for randomly generated
Goldreich's generators. We assume that input
variables are known. In that case, we show that when , there is an
exact threshold
such
that for , the LP relaxation can determine
linearly many input variables of Goldreich's generator if
, and that the LP relaxation cannot determine
input variables of Goldreich's generator if
. This paper uses characterization of LP solutions by
combinatorial structures called stopping sets on a bipartite graph, which is
related to a simple algorithm called peeling algorithm.Comment: 14 pages, 1 figur
Load thresholds for cuckoo hashing with double hashing
In k-ary cuckoo hashing, each of cn objects is associated with k random buckets in a hash table of size n. An l-orientation is an assignment of objects to associated buckets such that each bucket receives at most l objects. Several works have determined load thresholds c^* = c^*(k,l) for k-ary cuckoo hashing; that is, for c c^* no l-orientation exists with high probability.
A natural variant of k-ary cuckoo hashing utilizes double hashing, where, when the buckets are numbered 0,1,...,n-1, the k choices of random buckets form an arithmetic progression modulo n. Double hashing simplifies implementation and requires less randomness, and it has been shown that double hashing has the same behavior as fully random hashing in several other data structures that similarly use multiple hashes for each object. Interestingly, previous work has come close to but has not fully shown that the load threshold for k-ary cuckoo hashing is the same when using double hashing as when using fully random hashing. Specifically, previous work has shown that the thresholds for both settings coincide, except that for double hashing it was possible that o(n) objects would have been left unplaced. Here we close this open question by showing the thresholds are indeed the same, by providing a combinatorial argument that reconciles this stubborn difference
Simple Set Sketching
Imagine handling collisions in a hash table by storing, in each cell, the
bit-wise exclusive-or of the set of keys hashing there. This appears to be a
terrible idea: For keys and buckets, where is constant,
we expect that a constant fraction of the keys will be unrecoverable due to
collisions.
We show that if this collision resolution strategy is repeated three times
independently the situation reverses: If is below a threshold of
then we can recover the set of all inserted keys in linear time
with high probability.
Even though the description of our data structure is simple, its analysis is
nontrivial. Our approach can be seen as a variant of the Invertible Bloom
Filter (IBF) of Eppstein and Goodrich. While IBFs involve an explicit checksum
per bucket to decide whether the bucket stores a single key, we exploit the
idea of quotienting, namely that some bits of the key are implicit in the
location where it is stored. We let those serve as an implicit checksum. These
bits are not quite enough to ensure that no errors occur and the main technical
challenge is to show that decoding can recover from these errors.Comment: To be published at SIAM Symposium on Simplicity in Algorithms
(SOSA23
A Novel Approach to Finding Near-Cliques: The Triangle-Densest Subgraph Problem
Many graph mining applications rely on detecting subgraphs which are
near-cliques. There exists a dichotomy between the results in the existing work
related to this problem: on the one hand the densest subgraph problem (DSP)
which maximizes the average degree over all subgraphs is solvable in polynomial
time but for many networks fails to find subgraphs which are near-cliques. On
the other hand, formulations that are geared towards finding near-cliques are
NP-hard and frequently inapproximable due to connections with the Maximum
Clique problem.
In this work, we propose a formulation which combines the best of both
worlds: it is solvable in polynomial time and finds near-cliques when the DSP
fails. Surprisingly, our formulation is a simple variation of the DSP.
Specifically, we define the triangle densest subgraph problem (TDSP): given
, find a subset of vertices such that , where is the number of triangles induced
by the set . We provide various exact and approximation algorithms which the
solve the TDSP efficiently. Furthermore, we show how our algorithms adapt to
the more general problem of maximizing the -clique average density. Finally,
we provide empirical evidence that the TDSP should be used whenever the output
of the DSP fails to output a near-clique.Comment: 42 page
- âŠ