Search CORE

4,131 research outputs found

Characterizing Open Addressing Hash Functions

Author: Gregory L Heileman
Wenbin Luo
Publication venue
Publication date: 05/03/2020
Field of study

Abstract In [1], we showed that different open addressing hash functions perform differently when the data elements are not uniformly distributed. So, it is tempting to attribute their difference to some mechanism governing the behavior of the hash functions. In this paper, a simple method of characterizing open addressing hash functions is presented. We showed that, indeed, the nature of data spreading ability characterizes the behavior of different open addressing hash functions. We measured and analyzed the spreading speed of a cluster of data elements under different open addressing hash functions. Our experimental results and theoretical analysis showed that different hash functions have different abilities of spreading out clustered data elements. The hash function, which spreads out clustered data elements over the whole table space more uniformly and faster, has better performance when it is applied to clustered data. Experimental results are presented to support our claims, which is followed by some theoretic analysis. Hash function families In this paper we present hash functions in the form of nonlinear dynamical systems. So, first we derive dynamical system expressions for different hashing families. Linear and quadratic hashing The family of linear hash functions can be expressed as where h is an ordinary hash function, and i = 0, 1, . . . is the probe number. This technique is known as linear hashing because the argument of the modulus operator is linearly dependent on the probe number. In general, c 1 needs to be chosen so that it is relatively prime to m if all slots in the hash table are to be examined by the probe sequence. In order to construct an equivalent transformation for equation and using the fact that for a, b, m ∈ IR, we may rewrite equation Note that the dependence on k is specified in the initial condition H(k, 0). Quadratic hashing is a simple extension of linear hashing that makes the probe sequence nonlinearly dependent on the probe number. For any ordinary hash function h, the family of quadratic hash functions is given by where c 1 and c 2 are positive constants. Once again, the specific values chosen for the constants are critical to the performance of this method (see To obtain a recurrence relation solution to equation (3) we note that 2

CiteSeerX

A Taxonomy of Self-configuring Service Discovery Systems

Author: Hartel P.H.
Scholten J.
Sundramoorthy V.
Publication venue: Centre for Telematics and Information Technology, University of Twente
Publication date: 01/01/2007
Field of study

We analyze the fundamental concepts and issues in service discovery. This analysis places service discovery in the context of distributed systems by describing service discovery as a third generation naming system. We also describe the essential architectures and the functionalities in service discovery. We then proceed to show how service discovery fits into a system, by characterizing operational aspects. Subsequently, we describe how existing state of the art performs service discovery, in relation to the operational aspects and functionalities, and identify areas for improvement

University of Twente Research Information

Handling Massive N-Gram Datasets Efficiently

Author: Pibiri Giulio Ermanno
Venturini Rossano
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 25/06/2018
Field of study

This paper deals with the two fundamental problems concerning the handling of large n-gram language models: indexing, that is compressing the n-gram strings and associated satellite data without compromising their retrieval speed; and estimation, that is computing the probability distribution of the strings from a large textual source. Regarding the problem of indexing, we describe compressed, exact and lossless data structures that achieve, at the same time, high space reductions and no time degradation with respect to state-of-the-art solutions and related software packages. In particular, we present a compressed trie data structure in which each word following a context of fixed length k, i.e., its preceding k words, is encoded as an integer whose value is proportional to the number of words that follow such context. Since the number of words following a given context is typically very small in natural languages, we lower the space of representation to compression levels that were never achieved before. Despite the significant savings in space, our technique introduces a negligible penalty at query time. Regarding the problem of estimation, we present a novel algorithm for estimating modified Kneser-Ney language models, that have emerged as the de-facto choice for language modeling in both academia and industry, thanks to their relatively low perplexity performance. Estimating such models from large textual sources poses the challenge of devising algorithms that make a parsimonious use of the disk. The state-of-the-art algorithm uses three sorting steps in external memory: we show an improved construction that requires only one sorting step thanks to exploiting the properties of the extracted n-gram strings. With an extensive experimental analysis performed on billions of n-grams, we show an average improvement of 4.5X on the total running time of the state-of-the-art approach.Comment: Published in ACM Transactions on Information Systems (TOIS), February 2019, Article No: 2

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari