793 research outputs found

    Analysing the Performance of GPU Hash Tables for State Space Exploration

    Get PDF
    In the past few years, General Purpose Graphics Processors (GPUs) have been used to significantly speed up numerous applications. One of the areas in which GPUs have recently led to a significant speed-up is model checking. In model checking, state spaces, i.e., large directed graphs, are explored to verify whether models satisfy desirable properties. GPUexplore is a GPU-based model checker that uses a hash table to efficiently keep track of already explored states. As a large number of states is discovered and stored during such an exploration, the hash table should be able to quickly handle many inserts and queries concurrently. In this paper, we experimentally compare two different hash tables optimised for the GPU, one being the GPUexplore hash table, and the other using Cuckoo hashing. We compare the performance of both hash tables using random and non-random data obtained from model checking experiments, to analyse the applicability of the two hash tables for state space exploration. We conclude that Cuckoo hashing is three times faster than GPUexplore hashing for random data, and that Cuckoo hashing is five to nine times faster for non-random data. This suggests great potential to further speed up GPUexplore in the near future.Comment: In Proceedings GaM 2017, arXiv:1712.0834

    09491 Abstracts Collection -- Graph Search Engineering

    Get PDF
    From the 29th November to the 4th December 2009, the Dagstuhl Seminar 09491 ``Graph Search Engineering \u27\u27 was held in Schloss Dagstuhl~--~Leibniz Center for Informatics. During the seminar, several participants presented their current research, and ongoing work and open problems were discussed. Abstracts of the presentations given during the seminar as well as abstracts of seminar results and ideas are put together in this paper. The first section describes the seminar topics and goals in general. Links to extended abstracts or full papers are provided, if available

    Large scale parallel state space search utilizing graphics processing units and solid state disks

    Get PDF
    The evolution of science is a double-track process composed of theoretical insights on the one hand and practical inventions on the other one. While in most cases new theoretical insights motivate hardware developers to produce systems following the theory, in some cases the shown hardware solutions force theoretical research to forecast the results to expect. Progress in computer science rely on two aspects, processing information and storing it. Improving one side without touching the other will evidently impose new problems without producing a real alternative solution to the problem. While decreasing the time to solve a challenge may provide a solution to long term problems it will fail in solving problems which require much storage. In contrast, increasing the available amount of space for information storage will definitively allow harder problems to be solved by offering enough time. This work studies two recent developments in the hardware to utilize them in the domain of graph searching. The trend to discontinue information storage on magnetic disks and use electronic media instead and the tendency to parallelize the computation to speed up information processing are analyzed. Storing information on rotating magnetic disk has become the standard way since a couple of years and has reached a point where the storage capacity can be seen as infinite due to the possibility of adding new drives instantly with low costs. However, while the possible storage capacity increases every year, the transferring speed does not. At the beginning of this work, solid state media appeared on the market, slowly suppressing hard disks in speed demanding applications. Today, when finishing this work solid state drives are replacing magnetic disks in mobile computing, and computing centers use them as caching media to increase information retrieving speed. The reason is the huge advantage in random access where the speed does not drop so significantly as with magnetic drives. While storing and retrieving huge amounts of information is one side of the medal, the other one is the processing speed. Here the trend from increasing the clock frequency of single processors stagnated in 2006 and the manufacturers started to combine multiple cores in one processor. While a CPU is a general purpose processor the manufacturers of graphics processing units (GPUs) encounter the challenge to perform the same computation for a large number of image points. Here, a parallelization offers huge advantages, so modern graphics cards have evolved to highly parallel computing instances with several hundreds of cores. The challenge is to utilize these processors in other domains than graphics processing. One of the vastly used tasks in computer science is search. Not only disciplines with an obvious search but also in software testing searching a graph is the crucial aspect. Strategies which enable to examine larger graphs, be it by reducing the number of considered nodes or by increasing the searching speed, have to be developed to battle the rising challenges. This work enhances searching in multiple scientific domains like explicit state Model Checking, Action Planning, Game Solving and Probabilistic Model Checking proposing strategies to find solutions for the search problems. Providing an universal search strategy which can be used in all environments to utilize solid state media and graphics processing units is not possible due to the heterogeneous aspects of the domains. Thus, this work presents a tool kit of strategies tied together in an universal three stage strategy. In the first stage the edges leaving a node are determined, in the second stage the algorithm follows the edges to generate nodes. The duplicate detection in stage three compares all newly generated nodes to existing once and avoids multiple expansions. For each stage at least two strategies are proposed and decision hints are given to simplify the selection of the proper strategy. After describing the strategies the kit is evaluated in four domains explaining the choice for the strategy, evaluating its outcome and giving future clues on the topic

    ShockHash: Towards Optimal-Space Minimal Perfect Hashing Beyond Brute-Force

    Full text link
    A minimal perfect hash function (MPHF) maps a set SS of nn keys to the first nn integers without collisions. There is a lower bound of nlog2eO(logn)n\log_2e-O(\log n) bits of space needed to represent an MPHF. A matching upper bound is obtained using the brute-force algorithm that tries random hash functions until stumbling on an MPHF and stores that function's seed. In expectation, enpoly(n)e^n\textrm{poly}(n) seeds need to be tested. The most space-efficient previous algorithms for constructing MPHFs all use such a brute-force approach as a basic building block. In this paper, we introduce ShockHash - Small, heavily overloaded cuckoo hash tables. ShockHash uses two hash functions h0h_0 and h1h_1, hoping for the existence of a function f:S{0,1}f : S \rightarrow \{0,1\} such that xhf(x)(x)x \mapsto h_{f(x)}(x) is an MPHF on SS. In graph terminology, ShockHash generates nn-edge random graphs until stumbling on a pseudoforest - a graph where each component contains as many edges as nodes. Using cuckoo hashing, ShockHash then derives an MPHF from the pseudoforest in linear time. It uses a 1-bit retrieval data structure to store ff using n+o(n)n + o(n) bits. By carefully analyzing the probability that a random graph is a pseudoforest, we show that ShockHash needs to try only (e/2)npoly(n)(e/2)^n\textrm{poly}(n) hash function seeds in expectation, reducing the space for storing the seed by roughly nn bits. This makes ShockHash almost a factor 2n2^n faster than brute-force, while maintaining the asymptotically optimal space consumption. An implementation within the RecSplit framework yields the currently most space efficient MPHFs, i.e., competing approaches need about two orders of magnitude more work to achieve the same space

    A low-power, high-performance speech recognition accelerator

    Get PDF
    © 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Automatic Speech Recognition (ASR) is becoming increasingly ubiquitous, especially in the mobile segment. Fast and accurate ASR comes at high energy cost, not being affordable for the tiny power-budgeted mobile devices. Hardware acceleration reduces energy-consumption of ASR systems, while delivering high-performance. In this paper, we present an accelerator for largevocabulary, speaker-independent, continuous speech-recognition. It focuses on the Viterbi search algorithm representing the main bottleneck in an ASR system. The proposed design consists of innovative techniques to improve the memory subsystem, since memory is the main bottleneck for performance and power in these accelerators' design. It includes a prefetching scheme tailored to the needs of ASR systems that hides main memory latency for a large fraction of the memory accesses, negligibly impacting area. Additionally, we introduce a novel bandwidth-saving technique that removes off-chip memory accesses by 20 percent. Finally, we present a power saving technique that significantly reduces the leakage power of the accelerators scratchpad memories, providing between 8.5 and 29.2 percent reduction in entire power dissipation. Overall, the proposed design outperforms implementations running on the CPU by orders of magnitude, and achieves speedups between 1.7x and 5.9x for different speech decoders over a highly optimized CUDA implementation running on Geforce-GTX-980 GPU, while reducing the energy by 123-454x.Peer ReviewedPostprint (author's final draft

    Accelerating Binary String Comparisons with a Scalable, Streaming-Based System Architecture Based on FPGAs

    Get PDF
    Pilz S, Porrmann F, Kaiser M, Hagemeyer J, Hogan JM, Rückert U. Accelerating Binary String Comparisons with a Scalable, Streaming-Based System Architecture Based on FPGAs. Algorithms. 2020;13(2): 47.This paper is concerned with Field Programmable Gate Arrays (FPGA)-based systems for energy-efficient high-throughput string comparison. Modern applications which involve comparisons across large data sets—such as large sequence sets in molecular biology—are by their nature computationally intensive. In this work, we present a scalable FPGA-based system architecture to accelerate the comparison of binary strings. The current architecture supports arbitrary lengths in the range 16 to 2048-bit, covering a wide range of possible applications. In our example application, we consider DNA sequences embedded in a binary vector space through Locality Sensitive Hashing (LSH) one of several possible encodings that enable us to avoid more costly character-based operations. Here the resulting encoding is a 512-bit binary signature with comparisons based on the Hamming distance. In this approach, most of the load arises from the calculation of the O ( m ∗ n ) Hamming distances between the signatures, where m is the number of queries and n is the number of signatures contained in the database. Signature generation only needs to be performed once, and we do not consider it further, focusing instead on accelerating the signature comparisons. The proposed FPGA-based architecture is optimized for high-throughput using hundreds of computing elements, arranged in a systolic array. These core computing elements can be adapted to support other string comparison algorithms with little effort, while the other infrastructure stays the same. On a Xilinx Virtex UltraScale+ FPGA (XCVU9P-2), a peak throughput of 75.4 billion comparisons per second—of 512-bit signatures—was achieved, using a design with 384 parallel processing elements and a clock frequency of 200 MHz. This makes our FPGA design 86 times faster than a highly optimized CPU implementation. Compared to a GPU design, executed on an NVIDIA GTX1060, it performs nearly five times faster

    Optimizing group-by and aggregation using GPU-CPU co-processing

    Get PDF
    While GPU query processing is a well-studied area, real adoption is limited in practice as typically GPU execution is only significantly faster than CPU execution if the data resides in GPU memory, which limits scalability to small data scenarios where performance tends to be less critical. Another problem is that not all query code (e.g. UDFs) will realistically be able to run on GPUs. We therefore investigate CPU-GPU co-processing, where both the CPU and GPU are involved in evaluating the query in scenarios where the data does not fit in the GPU memory.As we wish to deeply explore opportunities for optimizing execution speed, we narrow our focus further to a specific well-studied OLAP scenario, amenable to such co-processing, in the form of the TPC-H benchmark Query 1.For this query, and at large scale factors, we are able to improve performance significantly over the state-of-the-art for GPU implementations; we present competitive performance of a GPU versus a state-of-the-art multi-core CPU baseline a novelty for data exceeding GPU memory size; and finally, we show that co-processing does provide significant additional speedup over any of the processors individually.We achieve this performance improvement by utilizing parallelism-friendly compression to alleviate the PCIe transfer bottleneck, query-compilation-like fusion of the processing operations, and a simple yet effective scheduling mechanism. We hope that some of these features can inspire future work on GPU-focused and heterogeneous analytic DBMSes.</p
    corecore