1,933 research outputs found

    Balanced Allocations and Double Hashing

    Full text link
    Double hashing has recently found more common usage in schemes that use multiple hash functions. In double hashing, for an item xx, one generates two hash values f(x)f(x) and g(x)g(x), and then uses combinations (f(x)+kg(x))modn(f(x) +k g(x)) \bmod n for k=0,1,2,...k=0,1,2,... to generate multiple hash values from the initial two. We first perform an empirical study showing that, surprisingly, the performance difference between double hashing and fully random hashing appears negligible in the standard balanced allocation paradigm, where each item is placed in the least loaded of dd choices, as well as several related variants. We then provide theoretical results that explain the behavior of double hashing in this context.Comment: Further updated, small improvements/typos fixe

    On randomness in Hash functions

    Get PDF
    In the talk, we shall discuss quality measures for hash functions used in data structures and algorithms, and survey positive and negative results. (This talk is not about cryptographic hash functions.) For the analysis of algorithms involving hash functions, it is often convenient to assume the hash functions used behave fully randomly; in some cases there is no analysis known that avoids this assumption. In practice, one needs to get by with weaker hash functions that can be generated by randomized algorithms. A well-studied range of applications concern realizations of dynamic dictionaries (linear probing, chained hashing, dynamic perfect hashing, cuckoo hashing and its generalizations) or Bloom filters and their variants. A particularly successful and useful means of classification are Carter and Wegman's universal or k-wise independent classes, introduced in 1977. A natural and widely used approach to analyzing an algorithm involving hash functions is to show that it works if a sufficiently strong universal class of hash functions is used, and to substitute one of the known constructions of such classes. This invites research into the question of just how much independence in the hash functions is necessary for an algorithm to work. Some recent analyses that gave impossibility results constructed rather artificial classes that would not work; other results pointed out natural, widely used hash classes that would not work in a particular application. Only recently it was shown that under certain assumptions on some entropy present in the set of keys even 2-wise independent hash classes will lead to strong randomness properties in the hash values. The negative results show that these results may not be taken as justification for using weak hash classes indiscriminately, in particular for key sets with structure. When stronger independence properties are needed for a theoretical analysis, one may resort to classic constructions. Only in 2003 it was found out how full randomness can be simulated using only linear space overhead (which is optimal). The "split-and-share" approach can be used to justify the full randomness assumption in some situations in which full randomness is needed for the analysis to go through, like in many applications involving multiple hash functions (e.g., generalized versions of cuckoo hashing with multiple hash functions or larger bucket sizes, load balancing, Bloom filters and variants, or minimal perfect hash function constructions). For practice, efficiency considerations beyond constant factors are important. It is not hard to construct very efficient 2-wise independent classes. Using k-wise independent classes for constant k bigger than 3 has become feasible in practice only by new constructions involving tabulation. This goes together well with the quite new result that linear probing works with 5-independent hash functions. Recent developments suggest that the classification of hash function constructions by their degree of independence alone may not be adequate in some cases. Thus, one may want to analyze the behavior of specific hash classes in specific applications, circumventing the concept of k-wise independence. Several such results were recently achieved concerning hash functions that utilize tabulation. In particular if the analysis of the application involves using randomness properties in graphs and hypergraphs (generalized cuckoo hashing, also in the version with a "stash", or load balancing), a hash class combining k-wise independence with tabulation has turned out to be very powerful

    Hashing for Similarity Search: A Survey

    Full text link
    Similarity search (nearest neighbor search) is a problem of pursuing the data items whose distances to a query item are the smallest from a large database. Various methods have been developed to address this problem, and recently a lot of efforts have been devoted to approximate search. In this paper, we present a survey on one of the main solutions, hashing, which has been widely studied since the pioneering work locality sensitive hashing. We divide the hashing algorithms two main categories: locality sensitive hashing, which designs hash functions without exploring the data distribution and learning to hash, which learns hash functions according the data distribution, and review them from various aspects, including hash function design and distance measure and search scheme in the hash coding space

    Framework and Algorithms for Operator-Managed Content Caching

    Get PDF
    We propose a complete framework targeting operator-driven content caching that can be equally applied to both ISP-operated Content Delivery Networks (CDNs) and future Information-Centric Networks (ICNs). In contrast to previous proposals in this area, our solution leverages operators’ control on cache placement and content routing, managing to considerably reduce network operating costs by minimizing the amount of transit traffic and balancing load among available network resources. In addition, our solution provides two key advantages over previous proposals. First, it allows for a simple computation of the optimal cache placement. Second, it provides knobs for operators to fine-tune performance. We validate our design through both analytical modeling and trace-driven simulations and show that our proposed solution achieves on average twice as many cache hits in comparison to previously proposed techniques, without increasing delivery latency. In addition, we show that the proposed framework achieves 19-33% better load balancing across links and caching nodes, being also robust to traffic spikes

    Power of d Choices with Simple Tabulation

    Get PDF

    Power of dd Choices with Simple Tabulation

    Get PDF
    Suppose that we are to place mm balls into nn bins sequentially using the dd-choice paradigm: For each ball we are given a choice of dd bins, according to dd hash functions h1,,hdh_1,\dots,h_d and we place the ball in the least loaded of these bins breaking ties arbitrarily. Our interest is in the number of balls in the fullest bin after all mm balls have been placed. Azar et al. [STOC'94] proved that when m=O(n)m=O(n) and when the hash functions are fully random the maximum load is at most lglgnlgd+O(1)\frac{\lg \lg n }{\lg d}+O(1) whp (i.e. with probability 1O(nγ)1-O(n^{-\gamma}) for any choice of γ\gamma). In this paper we suppose that the h1,,hdh_1,\dots,h_d are simple tabulation hash functions. Generalising a result by Dahlgaard et al [SODA'16] we show that for an arbitrary constant d2d\geq 2 the maximum load is O(lglgn)O(\lg \lg n) whp, and that expected maximum load is at most lglgnlgd+O(1)\frac{\lg \lg n}{\lg d}+O(1). We further show that by using a simple tie-breaking algorithm introduced by V\"ocking [J.ACM'03] the expected maximum load drops to lglgndlgφd+O(1)\frac{\lg \lg n}{d\lg \varphi_d}+O(1) where φd\varphi_d is the rate of growth of the dd-ary Fibonacci numbers. Both of these expected bounds match those of the fully random setting. The analysis by Dahlgaard et al. relies on a proof by P\u{a}tra\c{s}cu and Thorup [J.ACM'11] concerning the use of simple tabulation for cuckoo hashing. We need here a generalisation to d>2d>2 hash functions, but the original proof is an 8-page tour de force of ad-hoc arguments that do not appear to generalise. Our main technical contribution is a shorter, simpler and more accessible proof of the result by P\u{a}tra\c{s}cu and Thorup, where the relevant parts generalise nicely to the analysis of dd choices.Comment: Accepted at ICALP 201
    corecore