4,131 research outputs found

    Characterizing Open Addressing Hash Functions

    Get PDF
    Abstract In [1], we showed that different open addressing hash functions perform differently when the data elements are not uniformly distributed. So, it is tempting to attribute their difference to some mechanism governing the behavior of the hash functions. In this paper, a simple method of characterizing open addressing hash functions is presented. We showed that, indeed, the nature of data spreading ability characterizes the behavior of different open addressing hash functions. We measured and analyzed the spreading speed of a cluster of data elements under different open addressing hash functions. Our experimental results and theoretical analysis showed that different hash functions have different abilities of spreading out clustered data elements. The hash function, which spreads out clustered data elements over the whole table space more uniformly and faster, has better performance when it is applied to clustered data. Experimental results are presented to support our claims, which is followed by some theoretic analysis. Hash function families In this paper we present hash functions in the form of nonlinear dynamical systems. So, first we derive dynamical system expressions for different hashing families. Linear and quadratic hashing The family of linear hash functions can be expressed as where h is an ordinary hash function, and i = 0, 1, . . . is the probe number. This technique is known as linear hashing because the argument of the modulus operator is linearly dependent on the probe number. In general, c 1 needs to be chosen so that it is relatively prime to m if all slots in the hash table are to be examined by the probe sequence. In order to construct an equivalent transformation for equation and using the fact that for a, b, m āˆˆ IR, we may rewrite equation Note that the dependence on k is specified in the initial condition H(k, 0). Quadratic hashing is a simple extension of linear hashing that makes the probe sequence nonlinearly dependent on the probe number. For any ordinary hash function h, the family of quadratic hash functions is given by where c 1 and c 2 are positive constants. Once again, the specific values chosen for the constants are critical to the performance of this method (see To obtain a recurrence relation solution to equation (3) we note that 2

    A Taxonomy of Self-configuring Service Discovery Systems

    Get PDF
    We analyze the fundamental concepts and issues in service discovery. This analysis places service discovery in the context of distributed systems by describing service discovery as a third generation naming system. We also describe the essential architectures and the functionalities in service discovery. We then proceed to show how service discovery fits into a system, by characterizing operational aspects. Subsequently, we describe how existing state of the art performs service discovery, in relation to the operational aspects and functionalities, and identify areas for improvement

    Handling Massive N-Gram Datasets Efficiently

    Get PDF
    This paper deals with the two fundamental problems concerning the handling of large n-gram language models: indexing, that is compressing the n-gram strings and associated satellite data without compromising their retrieval speed; and estimation, that is computing the probability distribution of the strings from a large textual source. Regarding the problem of indexing, we describe compressed, exact and lossless data structures that achieve, at the same time, high space reductions and no time degradation with respect to state-of-the-art solutions and related software packages. In particular, we present a compressed trie data structure in which each word following a context of fixed length k, i.e., its preceding k words, is encoded as an integer whose value is proportional to the number of words that follow such context. Since the number of words following a given context is typically very small in natural languages, we lower the space of representation to compression levels that were never achieved before. Despite the significant savings in space, our technique introduces a negligible penalty at query time. Regarding the problem of estimation, we present a novel algorithm for estimating modified Kneser-Ney language models, that have emerged as the de-facto choice for language modeling in both academia and industry, thanks to their relatively low perplexity performance. Estimating such models from large textual sources poses the challenge of devising algorithms that make a parsimonious use of the disk. The state-of-the-art algorithm uses three sorting steps in external memory: we show an improved construction that requires only one sorting step thanks to exploiting the properties of the extracted n-gram strings. With an extensive experimental analysis performed on billions of n-grams, we show an average improvement of 4.5X on the total running time of the state-of-the-art approach.Comment: Published in ACM Transactions on Information Systems (TOIS), February 2019, Article No: 2
    • ā€¦
    corecore