173 research outputs found

    Wear Minimization for Cuckoo Hashing: How Not to Throw a Lot of Eggs into One Basket

    Full text link
    We study wear-leveling techniques for cuckoo hashing, showing that it is possible to achieve a memory wear bound of log⁥log⁥n+O(1)\log\log n+O(1) after the insertion of nn items into a table of size CnCn for a suitable constant CC using cuckoo hashing. Moreover, we study our cuckoo hashing method empirically, showing that it significantly improves on the memory wear performance for classic cuckoo hashing and linear probing in practice.Comment: 13 pages, 1 table, 7 figures; to appear at the 13th Symposium on Experimental Algorithms (SEA 2014

    Tight Thresholds for Cuckoo Hashing via XORSAT

    Full text link
    We settle the question of tight thresholds for offline cuckoo hashing. The problem can be stated as follows: we have n keys to be hashed into m buckets each capable of holding a single key. Each key has k >= 3 (distinct) associated buckets chosen uniformly at random and independently of the choices of other keys. A hash table can be constructed successfully if each key can be placed into one of its buckets. We seek thresholds alpha_k such that, as n goes to infinity, if n/m <= alpha for some alpha < alpha_k then a hash table can be constructed successfully with high probability, and if n/m >= alpha for some alpha > alpha_k a hash table cannot be constructed successfully with high probability. Here we are considering the offline version of the problem, where all keys and hash values are given, so the problem is equivalent to previous models of multiple-choice hashing. We find the thresholds for all values of k > 2 by showing that they are in fact the same as the previously known thresholds for the random k-XORSAT problem. We then extend these results to the setting where keys can have differing number of choices, and provide evidence in the form of an algorithm for a conjecture extending this result to cuckoo hash tables that store multiple keys in a bucket.Comment: Revision 3 contains missing details of proofs, as appendix

    SicHash -- Small Irregular Cuckoo Tables for Perfect Hashing

    Get PDF
    A Perfect Hash Function (PHF) is a hash function that has no collisions on a given input set. PHFs can be used for space efficient storage of data in an array, or for determining a compact representative of each object in the set. In this paper, we present the PHF construction algorithm SicHash - Small Irregular Cuckoo Tables for Perfect Hashing. At its core, SicHash uses a known technique: It places objects in a cuckoo hash table and then stores the final hash function choice of each object in a retrieval data structure. We combine the idea with irregular cuckoo hashing, where each object has a different number of hash functions. Additionally, we use many small tables that we overload beyond their asymptotic maximum load factor. The most space efficient competitors often use brute force methods to determine the PHFs. SicHash provides a more direct construction algorithm that only rarely needs to recompute parts. Our implementation improves the state of the art in terms of space usage versus construction time for a wide range of configurations. At the same time, it provides very fast queries

    Fast Succinct Retrieval and Approximate Membership Using Ribbon

    Get PDF
    A retrieval data structure for a static function f: S → {0,1}^r supports queries that return f(x) for any x ∈ S. Retrieval data structures can be used to implement a static approximate membership query data structure (AMQ), i.e., a Bloom filter alternative, with false positive rate 2^{-r}. The information-theoretic lower bound for both tasks is r|S| bits. While succinct theoretical constructions using (1+o(1))r|S| bits were known, these could not achieve very small overheads in practice because they have an unfavorable space-time tradeoff hidden in the asymptotic costs or because small overheads would only be reached for physically impossible input sizes. With bumped ribbon retrieval (BuRR), we present the first practical succinct retrieval data structure. In an extensive experimental evaluation BuRR achieves space overheads well below 1% while being faster than most previously used retrieval data structures (typically with space overheads at least an order of magnitude larger) and faster than classical Bloom filters (with space overhead ≄ 44%). This efficiency, including favorable constants, stems from a combination of simplicity, word parallelism, and high locality. We additionally describe homogeneous ribbon filter AMQs, which are even simpler and faster at the price of slightly larger space overhead

    On randomness in Hash functions

    Get PDF
    In the talk, we shall discuss quality measures for hash functions used in data structures and algorithms, and survey positive and negative results. (This talk is not about cryptographic hash functions.) For the analysis of algorithms involving hash functions, it is often convenient to assume the hash functions used behave fully randomly; in some cases there is no analysis known that avoids this assumption. In practice, one needs to get by with weaker hash functions that can be generated by randomized algorithms. A well-studied range of applications concern realizations of dynamic dictionaries (linear probing, chained hashing, dynamic perfect hashing, cuckoo hashing and its generalizations) or Bloom filters and their variants. A particularly successful and useful means of classification are Carter and Wegman's universal or k-wise independent classes, introduced in 1977. A natural and widely used approach to analyzing an algorithm involving hash functions is to show that it works if a sufficiently strong universal class of hash functions is used, and to substitute one of the known constructions of such classes. This invites research into the question of just how much independence in the hash functions is necessary for an algorithm to work. Some recent analyses that gave impossibility results constructed rather artificial classes that would not work; other results pointed out natural, widely used hash classes that would not work in a particular application. Only recently it was shown that under certain assumptions on some entropy present in the set of keys even 2-wise independent hash classes will lead to strong randomness properties in the hash values. The negative results show that these results may not be taken as justification for using weak hash classes indiscriminately, in particular for key sets with structure. When stronger independence properties are needed for a theoretical analysis, one may resort to classic constructions. Only in 2003 it was found out how full randomness can be simulated using only linear space overhead (which is optimal). The "split-and-share" approach can be used to justify the full randomness assumption in some situations in which full randomness is needed for the analysis to go through, like in many applications involving multiple hash functions (e.g., generalized versions of cuckoo hashing with multiple hash functions or larger bucket sizes, load balancing, Bloom filters and variants, or minimal perfect hash function constructions). For practice, efficiency considerations beyond constant factors are important. It is not hard to construct very efficient 2-wise independent classes. Using k-wise independent classes for constant k bigger than 3 has become feasible in practice only by new constructions involving tabulation. This goes together well with the quite new result that linear probing works with 5-independent hash functions. Recent developments suggest that the classification of hash function constructions by their degree of independence alone may not be adequate in some cases. Thus, one may want to analyze the behavior of specific hash classes in specific applications, circumventing the concept of k-wise independence. Several such results were recently achieved concerning hash functions that utilize tabulation. In particular if the analysis of the application involves using randomness properties in graphs and hypergraphs (generalized cuckoo hashing, also in the version with a "stash", or load balancing), a hash class combining k-wise independence with tabulation has turned out to be very powerful

    Random hypergraphs for hashing-based data structures

    Get PDF
    This thesis concerns dictionaries and related data structures that rely on providing several random possibilities for storing each key. Imagine information on a set S of m = |S| keys should be stored in n memory locations, indexed by [n] = {1,
,n}. Each object x [ELEMENT OF] S is assigned a small set e(x) [SUBSET OF OR EQUAL TO] [n] of locations by a random hash function, independent of other objects. Information on x must then be stored in the locations from e(x) only. It is possible that too many objects compete for the same locations, in particular if the load c = m/n is high. Successfully storing all information may then be impossible. For most distributions of e(x), however, success or failure can be predicted very reliably, since the success probability is close to 1 for loads c less than a certain load threshold c^* and close to 0 for loads greater than this load threshold. We mainly consider two types of data structures: ‱ A cuckoo hash table is a dictionary data structure where each key x [ELEMENT OF] S is stored together with an associated value f(x) in one of the memory locations with an index from e(x). The distribution of e(x) is controlled by the hashing scheme. We analyse three known hashing schemes, and determine their exact load thresholds. The schemes are unaligned blocks, double hashing and a scheme for dynamically growing key sets. ‱ A retrieval data structure also stores a value f(x) for each x [ELEMENT OF] S. This time, the values stored in the memory locations from e(x) must satisfy a linear equation that characterises the value f(x). The resulting data structure is extremely compact, but unusual. It cannot answer questions of the form “is y [ELEMENT OF] S?”. Given a key y it returns a value z. If y [ELEMENT OF] S, then z = f(y) is guaranteed, otherwise z may be an arbitrary value. We consider two new hashing schemes, where the elements of e(x) are contained in one or two contiguous blocks. This yields good access times on a word RAM and high cache efficiency. An important question is whether these types of data structures can be constructed in linear time. The success probability of a natural linear time greedy algorithm exhibits, once again, threshold behaviour with respect to the load c. We identify a hashing scheme that leads to a particularly high threshold value in this regard. In the mathematical model, the memory locations [n] correspond to vertices, and the sets e(x) for x [ELEMENT OF] S correspond to hyperedges. Three properties of the resulting hypergraphs turn out to be important: peelability, solvability and orientability. Therefore, large parts of this thesis examine how hyperedge distribution and load affects the probabilities with which these properties hold and derive corresponding thresholds. Translated back into the world of data structures, we achieve low access times, high memory efficiency and low construction times. We complement and support the theoretical results by experiments.Diese Arbeit behandelt WörterbĂŒcher und verwandte Datenstrukturen, die darauf aufbauen, mehrere zufĂ€llige Möglichkeiten zur Speicherung jedes SchlĂŒssels vorzusehen. Man stelle sich vor, Information ĂŒber eine Menge S von m = |S| SchlĂŒsseln soll in n SpeicherplĂ€tzen abgelegt werden, die durch [n] = {1,
,n} indiziert sind. Jeder SchlĂŒssel x [ELEMENT OF] S bekommt eine kleine Menge e(x) [SUBSET OF OR EQUAL TO] [n] von SpeicherplĂ€tzen durch eine zufĂ€llige Hashfunktion unabhĂ€ngig von anderen SchlĂŒsseln zugewiesen. Die Information ĂŒber x darf nun ausschließlich in den PlĂ€tzen aus e(x) untergebracht werden. Es kann hierbei passieren, dass zu viele SchlĂŒssel um dieselben SpeicherplĂ€tze konkurrieren, insbesondere bei hoher Auslastung c = m/n. Eine erfolgreiche Speicherung der Gesamtinformation ist dann eventuell unmöglich. FĂŒr die meisten Verteilungen von e(x) lĂ€sst sich Erfolg oder Misserfolg allerdings sehr zuverlĂ€ssig vorhersagen, da fĂŒr Auslastung c unterhalb eines gewissen Auslastungsschwellwertes c* die Erfolgswahrscheinlichkeit nahezu 1 ist und fĂŒr c jenseits dieses Auslastungsschwellwertes nahezu 0 ist. HauptsĂ€chlich werden wir zwei Arten von Datenstrukturen betrachten: ‱ Eine Kuckucks-Hashtabelle ist eine Wörterbuchdatenstruktur, bei der jeder SchlĂŒssel x [ELEMENT OF] S zusammen mit einem assoziierten Wert f(x) in einem der SpeicherplĂ€tze mit Index aus e(x) gespeichert wird. Die Verteilung von e(x) wird hierbei vom Hashing-Schema festgelegt. Wir analysieren drei bekannte Hashing-Schemata und bestimmen erstmals deren exakte Auslastungsschwellwerte im obigen Sinne. Die Schemata sind unausgerichtete Blöcke, Doppel-Hashing sowie ein Schema fĂŒr dynamisch wachsenden SchlĂŒsselmengen. ‱ Auch eine Retrieval-Datenstruktur speichert einen Wert f(x) fĂŒr alle x [ELEMENT OF] S. Diesmal sollen die Werte in den SpeicherplĂ€tzen aus e(x) eine lineare Gleichung erfĂŒllen, die den Wert f(x) charakterisiert. Die entstehende Datenstruktur ist extrem platzsparend, aber ungewöhnlich: Sie ist ungeeignet um Fragen der Form „ist y [ELEMENT OF] S?“ zu beantworten. Bei Anfrage eines SchlĂŒssels y wird ein Ergebnis z zurĂŒckgegeben. Falls y [ELEMENT OF] S ist, so ist z = f(y) garantiert, andernfalls darf z ein beliebiger Wert sein. Wir betrachten zwei neue Hashing-Schemata, bei denen die Elemente von e(x) in einem oder in zwei zusammenhĂ€ngenden Blöcken liegen. So werden gute Zugriffszeiten auf Word-RAMs und eine hohe Cache-Effizienz erzielt. Eine wichtige Frage ist, ob Datenstrukturen obiger Art in Linearzeit konstruiert werden können. Die Erfolgswahrscheinlichkeit eines naheliegenden Greedy-Algorithmus weist abermals ein Schwellwertverhalten in Bezug auf die Auslastung c auf. Wir identifizieren ein Hashing-Schema, das diesbezĂŒglich einen besonders hohen Schwellwert mit sich bringt. In der mathematischen Modellierung werden die Speicherpositionen [n] als Knoten und die Mengen e(x) fĂŒr x [ELEMENT OF] S als Hyperkanten aufgefasst. Drei Eigenschaften der entstehenden Hypergraphen stellen sich dann als zentral heraus: SchĂ€lbarkeit, Lösbarkeit und Orientierbarkeit. Weite Teile dieser Arbeit beschĂ€ftigen sich daher mit den Wahrscheinlichkeiten fĂŒr das Vorliegen dieser Eigenschaften abhĂ€ngig von Hashing Schema und Auslastung, sowie mit entsprechenden Schwellwerten. Eine RĂŒckĂŒbersetzung der Ergebnisse liefert dann Datenstrukturen mit geringen Anfragezeiten, hoher Speichereffizienz und geringen Konstruktionszeiten. Die theoretischen Überlegungen werden dabei durch experimentelle Ergebnisse ergĂ€nzt und gestĂŒtzt
    • 

    corecore