12,399 research outputs found

    GPUs as Storage System Accelerators

    Full text link
    Massively multicore processors, such as Graphics Processing Units (GPUs), provide, at a comparable price, a one order of magnitude higher peak performance than traditional CPUs. This drop in the cost of computation, as any order-of-magnitude drop in the cost per unit of performance for a class of system components, triggers the opportunity to redesign systems and to explore new ways to engineer them to recalibrate the cost-to-performance relation. This project explores the feasibility of harnessing GPUs' computational power to improve the performance, reliability, or security of distributed storage systems. In this context, we present the design of a storage system prototype that uses GPU offloading to accelerate a number of computationally intensive primitives based on hashing, and introduce techniques to efficiently leverage the processing power of GPUs. We evaluate the performance of this prototype under two configurations: as a content addressable storage system that facilitates online similarity detection between successive versions of the same file and as a traditional system that uses hashing to preserve data integrity. Further, we evaluate the impact of offloading to the GPU on competing applications' performance. Our results show that this technique can bring tangible performance gains without negatively impacting the performance of concurrently running applications.Comment: IEEE Transactions on Parallel and Distributed Systems, 201

    Strongly universal string hashing is fast

    Get PDF
    We present fast strongly universal string hashing families: they can process data at a rate of 0.2 CPU cycle per byte. Maybe surprisingly, we find that these families---though they require a large buffer of random numbers---are often faster than popular hash functions with weaker theoretical guarantees. Moreover, conventional wisdom is that hash functions with fewer multiplications are faster. Yet we find that they may fail to be faster due to operation pipelining. We present experimental results on several processors including low-powered processors. Our tests include hash functions designed for processors with the Carry-Less Multiplication (CLMUL) instruction set. We also prove, using accessible proofs, the strong universality of our families.Comment: Software is available at http://code.google.com/p/variablelengthstringhashing/ and https://github.com/lemire/StronglyUniversalStringHashin

    Fast Locality-Sensitive Hashing Frameworks for Approximate Near Neighbor Search

    Full text link
    The Indyk-Motwani Locality-Sensitive Hashing (LSH) framework (STOC 1998) is a general technique for constructing a data structure to answer approximate near neighbor queries by using a distribution H\mathcal{H} over locality-sensitive hash functions that partition space. For a collection of nn points, after preprocessing, the query time is dominated by O(nρlogn)O(n^{\rho} \log n) evaluations of hash functions from H\mathcal{H} and O(nρ)O(n^{\rho}) hash table lookups and distance computations where ρ(0,1)\rho \in (0,1) is determined by the locality-sensitivity properties of H\mathcal{H}. It follows from a recent result by Dahlgaard et al. (FOCS 2017) that the number of locality-sensitive hash functions can be reduced to O(log2n)O(\log^2 n), leaving the query time to be dominated by O(nρ)O(n^{\rho}) distance computations and O(nρlogn)O(n^{\rho} \log n) additional word-RAM operations. We state this result as a general framework and provide a simpler analysis showing that the number of lookups and distance computations closely match the Indyk-Motwani framework, making it a viable replacement in practice. Using ideas from another locality-sensitive hashing framework by Andoni and Indyk (SODA 2006) we are able to reduce the number of additional word-RAM operations to O(nρ)O(n^\rho).Comment: 15 pages, 3 figure

    Wear Minimization for Cuckoo Hashing: How Not to Throw a Lot of Eggs into One Basket

    Full text link
    We study wear-leveling techniques for cuckoo hashing, showing that it is possible to achieve a memory wear bound of loglogn+O(1)\log\log n+O(1) after the insertion of nn items into a table of size CnCn for a suitable constant CC using cuckoo hashing. Moreover, we study our cuckoo hashing method empirically, showing that it significantly improves on the memory wear performance for classic cuckoo hashing and linear probing in practice.Comment: 13 pages, 1 table, 7 figures; to appear at the 13th Symposium on Experimental Algorithms (SEA 2014

    Scalable and Sustainable Deep Learning via Randomized Hashing

    Full text link
    Current deep learning architectures are growing larger in order to learn from complex datasets. These architectures require giant matrix multiplication operations to train millions of parameters. Conversely, there is another growing trend to bring deep learning to low-power, embedded devices. The matrix operations, associated with both training and testing of deep networks, are very expensive from a computational and energy standpoint. We present a novel hashing based technique to drastically reduce the amount of computation needed to train and test deep networks. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select the nodes with the highest activation efficiently. Our new algorithm for deep learning reduces the overall computational cost of forward and back-propagation by operating on significantly fewer (sparse) nodes. As a consequence, our algorithm uses only 5% of the total multiplications, while keeping on average within 1% of the accuracy of the original model. A unique property of the proposed hashing based back-propagation is that the updates are always sparse. Due to the sparse gradient updates, our algorithm is ideally suited for asynchronous and parallel training leading to near linear speedup with increasing number of cores. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations on several real datasets
    corecore