793 research outputs found
Analysing the Performance of GPU Hash Tables for State Space Exploration
In the past few years, General Purpose Graphics Processors (GPUs) have been
used to significantly speed up numerous applications. One of the areas in which
GPUs have recently led to a significant speed-up is model checking. In model
checking, state spaces, i.e., large directed graphs, are explored to verify
whether models satisfy desirable properties. GPUexplore is a GPU-based model
checker that uses a hash table to efficiently keep track of already explored
states. As a large number of states is discovered and stored during such an
exploration, the hash table should be able to quickly handle many inserts and
queries concurrently. In this paper, we experimentally compare two different
hash tables optimised for the GPU, one being the GPUexplore hash table, and the
other using Cuckoo hashing. We compare the performance of both hash tables
using random and non-random data obtained from model checking experiments, to
analyse the applicability of the two hash tables for state space exploration.
We conclude that Cuckoo hashing is three times faster than GPUexplore hashing
for random data, and that Cuckoo hashing is five to nine times faster for
non-random data. This suggests great potential to further speed up GPUexplore
in the near future.Comment: In Proceedings GaM 2017, arXiv:1712.0834
09491 Abstracts Collection -- Graph Search Engineering
From the 29th November to the 4th December 2009, the Dagstuhl Seminar
09491 ``Graph Search Engineering \u27\u27 was held
in Schloss Dagstuhl~--~Leibniz Center for Informatics.
During the seminar, several participants presented their current
research, and ongoing work and open problems were discussed. Abstracts of
the presentations given during the seminar as well as abstracts of
seminar results and ideas are put together in this paper. The first section
describes the seminar topics and goals in general.
Links to extended abstracts or full papers are provided, if available
Large scale parallel state space search utilizing graphics processing units and solid state disks
The evolution of science is a double-track process composed of theoretical insights on
the one hand and practical inventions on the other one. While in most cases new theoretical
insights motivate hardware developers to produce systems following the theory,
in some cases the shown hardware solutions force theoretical research to forecast the
results to expect.
Progress in computer science rely on two aspects, processing information and storing
it. Improving one side without touching the other will evidently impose new problems
without producing a real alternative solution to the problem. While decreasing
the time to solve a challenge may provide a solution to long term problems it will fail
in solving problems which require much storage. In contrast, increasing the available
amount of space for information storage will definitively allow harder problems to be
solved by offering enough time.
This work studies two recent developments in the hardware to utilize them in the
domain of graph searching. The trend to discontinue information storage on magnetic
disks and use electronic media instead and the tendency to parallelize the computation
to speed up information processing are analyzed.
Storing information on rotating magnetic disk has become the standard way since
a couple of years and has reached a point where the storage capacity can be seen as
infinite due to the possibility of adding new drives instantly with low costs. However,
while the possible storage capacity increases every year, the transferring speed does
not. At the beginning of this work, solid state media appeared on the market, slowly
suppressing hard disks in speed demanding applications. Today, when finishing this
work solid state drives are replacing magnetic disks in mobile computing, and computing
centers use them as caching media to increase information retrieving speed.
The reason is the huge advantage in random access where the speed does not drop so
significantly as with magnetic drives.
While storing and retrieving huge amounts of information is one side of the medal,
the other one is the processing speed. Here the trend from increasing the clock frequency
of single processors stagnated in 2006 and the manufacturers started to combine
multiple cores in one processor. While a CPU is a general purpose processor the
manufacturers of graphics processing units (GPUs) encounter the challenge to perform
the same computation for a large number of image points. Here, a parallelization offers
huge advantages, so modern graphics cards have evolved to highly parallel computing
instances with several hundreds of cores. The challenge is to utilize these processors
in other domains than graphics processing.
One of the vastly used tasks in computer science is search. Not only disciplines with
an obvious search but also in software testing searching a graph is the crucial aspect.
Strategies which enable to examine larger graphs, be it by reducing the number of
considered nodes or by increasing the searching speed, have to be developed to battle
the rising challenges. This work enhances searching in multiple scientific domains
like explicit state Model Checking, Action Planning, Game Solving and Probabilistic
Model Checking proposing strategies to find solutions for the search problems.
Providing an universal search strategy which can be used in all environments to
utilize solid state media and graphics processing units is not possible due to the
heterogeneous aspects of the domains. Thus, this work presents a tool kit of strategies tied
together in an universal three stage strategy. In the first stage the edges leaving a node
are determined, in the second stage the algorithm follows the edges to generate nodes.
The duplicate detection in stage three compares all newly generated nodes to existing
once and avoids multiple expansions.
For each stage at least two strategies are proposed and decision hints are given to
simplify the selection of the proper strategy. After describing the strategies the kit is
evaluated in four domains explaining the choice for the strategy, evaluating its outcome
and giving future clues on the topic
ShockHash: Towards Optimal-Space Minimal Perfect Hashing Beyond Brute-Force
A minimal perfect hash function (MPHF) maps a set of keys to the
first integers without collisions. There is a lower bound of
bits of space needed to represent an MPHF. A matching
upper bound is obtained using the brute-force algorithm that tries random hash
functions until stumbling on an MPHF and stores that function's seed. In
expectation, seeds need to be tested. The most
space-efficient previous algorithms for constructing MPHFs all use such a
brute-force approach as a basic building block.
In this paper, we introduce ShockHash - Small, heavily overloaded cuckoo hash
tables. ShockHash uses two hash functions and , hoping for the
existence of a function such that is an MPHF on . In graph terminology, ShockHash generates
-edge random graphs until stumbling on a pseudoforest - a graph where each
component contains as many edges as nodes. Using cuckoo hashing, ShockHash then
derives an MPHF from the pseudoforest in linear time. It uses a 1-bit retrieval
data structure to store using bits.
By carefully analyzing the probability that a random graph is a pseudoforest,
we show that ShockHash needs to try only hash
function seeds in expectation, reducing the space for storing the seed by
roughly bits. This makes ShockHash almost a factor faster than
brute-force, while maintaining the asymptotically optimal space consumption. An
implementation within the RecSplit framework yields the currently most space
efficient MPHFs, i.e., competing approaches need about two orders of magnitude
more work to achieve the same space
A low-power, high-performance speech recognition accelerator
© 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Automatic Speech Recognition (ASR) is becoming increasingly ubiquitous, especially in the mobile segment. Fast and accurate ASR comes at high energy cost, not being affordable for the tiny power-budgeted mobile devices. Hardware acceleration reduces energy-consumption of ASR systems, while delivering high-performance. In this paper, we present an accelerator for largevocabulary, speaker-independent, continuous speech-recognition. It focuses on the Viterbi search algorithm representing the main bottleneck in an ASR system. The proposed design consists of innovative techniques to improve the memory subsystem, since memory is the main bottleneck for performance and power in these accelerators' design. It includes a prefetching scheme tailored to the needs of ASR systems that hides main memory latency for a large fraction of the memory accesses, negligibly impacting area. Additionally, we introduce a novel bandwidth-saving technique that removes off-chip memory accesses by 20 percent. Finally, we present a power saving technique that significantly reduces the leakage power of the accelerators scratchpad memories, providing between 8.5 and 29.2 percent reduction in entire power dissipation. Overall, the proposed design outperforms implementations running on the CPU by orders of magnitude, and achieves speedups between 1.7x and 5.9x for different speech decoders over a highly optimized CUDA implementation running on Geforce-GTX-980 GPU, while reducing the energy by 123-454x.Peer ReviewedPostprint (author's final draft
Platform Dependent Verification: On Engineering Verification Tools for 21st Century
The paper overviews recent developments in platform-dependent explicit-state
LTL model checking.Comment: In Proceedings PDMC 2011, arXiv:1111.006
Accelerating Binary String Comparisons with a Scalable, Streaming-Based System Architecture Based on FPGAs
Pilz S, Porrmann F, Kaiser M, Hagemeyer J, Hogan JM, Rückert U. Accelerating Binary String Comparisons with a Scalable, Streaming-Based System Architecture Based on FPGAs. Algorithms. 2020;13(2): 47.This paper is concerned with Field Programmable Gate Arrays (FPGA)-based systems for energy-efficient high-throughput string comparison. Modern applications which involve comparisons across large data sets—such as large sequence sets in molecular biology—are by their nature computationally intensive. In this work, we present a scalable FPGA-based system architecture to accelerate the comparison of binary strings. The current architecture supports arbitrary lengths in the range 16 to 2048-bit, covering a wide range of possible applications. In our example application, we consider DNA sequences embedded in a binary vector space through Locality Sensitive Hashing (LSH) one of several possible encodings that enable us to avoid more costly character-based operations. Here the resulting encoding is a 512-bit binary signature with comparisons based on the Hamming distance. In this approach, most of the load arises from the calculation of the O ( m ∗ n ) Hamming distances between the signatures, where m is the number of queries and n is the number of signatures contained in the database. Signature generation only needs to be performed once, and we do not consider it further, focusing instead on accelerating the signature comparisons. The proposed FPGA-based architecture is optimized for high-throughput using hundreds of computing elements, arranged in a systolic array. These core computing elements can be adapted to support other string comparison algorithms with little effort, while the other infrastructure stays the same. On a Xilinx Virtex UltraScale+ FPGA (XCVU9P-2), a peak throughput of 75.4 billion comparisons per second—of 512-bit signatures—was achieved, using a design with 384 parallel processing elements and a clock frequency of 200 MHz. This makes our FPGA design 86 times faster than a highly optimized CPU implementation. Compared to a GPU design, executed on an NVIDIA GTX1060, it performs nearly five times faster
Optimizing group-by and aggregation using GPU-CPU co-processing
While GPU query processing is a well-studied area, real adoption is limited in practice as typically GPU execution is only significantly faster than CPU execution if the data resides in GPU memory, which limits scalability to small data scenarios where performance tends to be less critical. Another problem is that not all query code (e.g. UDFs) will realistically be able to run on GPUs. We therefore investigate CPU-GPU co-processing, where both the CPU and GPU are involved in evaluating the query in scenarios where the data does not fit in the GPU memory.As we wish to deeply explore opportunities for optimizing execution speed, we narrow our focus further to a specific well-studied OLAP scenario, amenable to such co-processing, in the form of the TPC-H benchmark Query 1.For this query, and at large scale factors, we are able to improve performance significantly over the state-of-the-art for GPU implementations; we present competitive performance of a GPU versus a state-of-the-art multi-core CPU baseline a novelty for data exceeding GPU memory size; and finally, we show that co-processing does provide significant additional speedup over any of the processors individually.We achieve this performance improvement by utilizing parallelism-friendly compression to alleviate the PCIe transfer bottleneck, query-compilation-like fusion of the processing operations, and a simple yet effective scheduling mechanism. We hope that some of these features can inspire future work on GPU-focused and heterogeneous analytic DBMSes.</p
- …