3,041 research outputs found
Analysis of KECCAK Tree Hashing on GPU Architectures
In an effort to provide security and data integrity, hashing algorithms have been designed to consume an input of any length to produce a fixed length output. KECCAK was selected by NIST to become the next Secure Hashing Algorithm SHA-3) after nearly five years of competition. In addition to providing a sequential operating mode, there is also a tree mode that allows large input messages to be hashed in parallel.
This thesis focuses on the exploration and analysis of the KECCAK tree hashing mode on a GPU platform. Based on the implementation, there are core features of the GPU that could be used to accelerate the time it takes to complete a hash due to the massively parallel architecture of the device. In addition to analyzing the speed of the algorithm, the underlying hardware is profiled to identify the bottlenecks that limited the speed.
The results of this work show that tree hashing can hash data at rates of up to 3 GB/s for the fixed size tree mode. On a 3.40 GHz CPU, this is the equivalent of 1.03 cycles per byte, more than six times faster than a sequential implementation for a very large input. For the variable size tree mode, the throughput was 500 MB/s. Based on the performance analysis, modification of the input rate of the KECCAK sponge resulted in a negligible change to the overall speed. As a result of the hardware profiling, the register and L1 cache usage in the GPU was a major bottleneck to the overall throughput. In a simulated GPU environment, it was shown that increasing the L1 cache by 25 percent could increase the throughput by up to 30 percent for a small tree and 15 percent for a tree that will achieve the greatest throughput on a real GPU. When this modification is combined with an increase of the L2 cache, performance can be improved by up to 20 percent
Analysing the Performance of GPU Hash Tables for State Space Exploration
In the past few years, General Purpose Graphics Processors (GPUs) have been
used to significantly speed up numerous applications. One of the areas in which
GPUs have recently led to a significant speed-up is model checking. In model
checking, state spaces, i.e., large directed graphs, are explored to verify
whether models satisfy desirable properties. GPUexplore is a GPU-based model
checker that uses a hash table to efficiently keep track of already explored
states. As a large number of states is discovered and stored during such an
exploration, the hash table should be able to quickly handle many inserts and
queries concurrently. In this paper, we experimentally compare two different
hash tables optimised for the GPU, one being the GPUexplore hash table, and the
other using Cuckoo hashing. We compare the performance of both hash tables
using random and non-random data obtained from model checking experiments, to
analyse the applicability of the two hash tables for state space exploration.
We conclude that Cuckoo hashing is three times faster than GPUexplore hashing
for random data, and that Cuckoo hashing is five to nine times faster for
non-random data. This suggests great potential to further speed up GPUexplore
in the near future.Comment: In Proceedings GaM 2017, arXiv:1712.0834
GPUs as Storage System Accelerators
Massively multicore processors, such as Graphics Processing Units (GPUs),
provide, at a comparable price, a one order of magnitude higher peak
performance than traditional CPUs. This drop in the cost of computation, as any
order-of-magnitude drop in the cost per unit of performance for a class of
system components, triggers the opportunity to redesign systems and to explore
new ways to engineer them to recalibrate the cost-to-performance relation. This
project explores the feasibility of harnessing GPUs' computational power to
improve the performance, reliability, or security of distributed storage
systems. In this context, we present the design of a storage system prototype
that uses GPU offloading to accelerate a number of computationally intensive
primitives based on hashing, and introduce techniques to efficiently leverage
the processing power of GPUs. We evaluate the performance of this prototype
under two configurations: as a content addressable storage system that
facilitates online similarity detection between successive versions of the same
file and as a traditional system that uses hashing to preserve data integrity.
Further, we evaluate the impact of offloading to the GPU on competing
applications' performance. Our results show that this technique can bring
tangible performance gains without negatively impacting the performance of
concurrently running applications.Comment: IEEE Transactions on Parallel and Distributed Systems, 201
FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search
We present FLASH (\textbf{F}ast \textbf{L}SH \textbf{A}lgorithm for
\textbf{S}imilarity search accelerated with \textbf{H}PC), a similarity search
system for ultra-high dimensional datasets on a single machine, that does not
require similarity computations and is tailored for high-performance computing
platforms. By leveraging a LSH style randomized indexing procedure and
combining it with several principled techniques, such as reservoir sampling,
recent advances in one-pass minwise hashing, and count based estimations, we
reduce the computational and parallelization costs of similarity search, while
retaining sound theoretical guarantees.
We evaluate FLASH on several real, high-dimensional datasets from different
domains, including text, malicious URL, click-through prediction, social
networks, etc. Our experiments shed new light on the difficulties associated
with datasets having several million dimensions. Current state-of-the-art
implementations either fail on the presented scale or are orders of magnitude
slower than FLASH. FLASH is capable of computing an approximate k-NN graph,
from scratch, over the full webspam dataset (1.3 billion nonzeros) in less than
10 seconds. Computing a full k-NN graph in less than 10 seconds on the webspam
dataset, using brute-force (), will require at least 20 teraflops. We
provide CPU and GPU implementations of FLASH for replicability of our results
Improved Parallel Rabin-Karp Algorithm Using Compute Unified Device Architecture
String matching algorithms are among one of the most widely used algorithms
in computer science. Traditional string matching algorithms efficiency of
underlaying string matching algorithm will greatly increase the efficiency of
any application. In recent years, Graphics processing units are emerged as
highly parallel processor. They out perform best of the central processing
units in scientific computation power. By combining recent advancement in
graphics processing units with string matching algorithms will allows to speed
up process of string matching. In this paper we proposed modified parallel
version of Rabin-Karp algorithm using graphics processing unit. Based on that,
result of CPU as well as parallel GPU implementations are compared for
evaluating effect of varying number of threads, cores, file size as well as
pattern size.Comment: Information and Communication Technology for Intelligent Systems
(ICTIS 2017
GPU LSM: A Dynamic Dictionary Data Structure for the GPU
We develop a dynamic dictionary data structure for the GPU, supporting fast
insertions and deletions, based on the Log Structured Merge tree (LSM). Our
implementation on an NVIDIA K40c GPU has an average update (insertion or
deletion) rate of 225 M elements/s, 13.5x faster than merging items into a
sorted array. The GPU LSM supports the retrieval operations of lookup, count,
and range query operations with an average rate of 75 M, 32 M and 23 M
queries/s respectively. The trade-off for the dynamic updates is that the
sorted array is almost twice as fast on retrievals. We believe that our GPU LSM
is the first dynamic general-purpose dictionary data structure for the GPU.Comment: 11 pages, accepted to appear on the Proceedings of IEEE International
Parallel and Distributed Processing Symposium (IPDPS'18
- …