297 research outputs found
Analysing the Performance of GPU Hash Tables for State Space Exploration
In the past few years, General Purpose Graphics Processors (GPUs) have been
used to significantly speed up numerous applications. One of the areas in which
GPUs have recently led to a significant speed-up is model checking. In model
checking, state spaces, i.e., large directed graphs, are explored to verify
whether models satisfy desirable properties. GPUexplore is a GPU-based model
checker that uses a hash table to efficiently keep track of already explored
states. As a large number of states is discovered and stored during such an
exploration, the hash table should be able to quickly handle many inserts and
queries concurrently. In this paper, we experimentally compare two different
hash tables optimised for the GPU, one being the GPUexplore hash table, and the
other using Cuckoo hashing. We compare the performance of both hash tables
using random and non-random data obtained from model checking experiments, to
analyse the applicability of the two hash tables for state space exploration.
We conclude that Cuckoo hashing is three times faster than GPUexplore hashing
for random data, and that Cuckoo hashing is five to nine times faster for
non-random data. This suggests great potential to further speed up GPUexplore
in the near future.Comment: In Proceedings GaM 2017, arXiv:1712.0834
09491 Abstracts Collection -- Graph Search Engineering
From the 29th November to the 4th December 2009, the Dagstuhl Seminar
09491 ``Graph Search Engineering \u27\u27 was held
in Schloss Dagstuhl~--~Leibniz Center for Informatics.
During the seminar, several participants presented their current
research, and ongoing work and open problems were discussed. Abstracts of
the presentations given during the seminar as well as abstracts of
seminar results and ideas are put together in this paper. The first section
describes the seminar topics and goals in general.
Links to extended abstracts or full papers are provided, if available
A structural analysis of the A5/1 state transition graph
We describe efficient algorithms to analyze the cycle structure of the graph
induced by the state transition function of the A5/1 stream cipher used in GSM
mobile phones and report on the results of the implementation. The analysis is
performed in five steps utilizing HPC clusters, GPGPU and external memory
computation. A great reduction of this huge state transition graph of 2^64
nodes is achieved by focusing on special nodes in the first step and removing
leaf nodes that can be detected with limited effort in the second step. This
step does not break the overall structure of the graph and keeps at least one
node on every cycle. In the third step the nodes of the reduced graph are
connected by weighted edges. Since the number of nodes is still huge an
efficient bitslice approach is presented that is implemented with NVIDIA's CUDA
framework and executed on several GPUs concurrently. An external memory
algorithm based on the STXXL library and its parallel pipelining feature
further reduces the graph in the fourth step. The result is a graph containing
only cycles that can be further analyzed in internal memory to count the number
and size of the cycles. This full analysis which previously would take months
can now be completed within a few days and allows to present structural results
for the full graph for the first time. The structure of the A5/1 graph deviates
notably from the theoretical results for random mappings.Comment: In Proceedings GRAPHITE 2012, arXiv:1210.611
Performance Evaluation of Blocking and Non-Blocking Concurrent Queues on GPUs
The efficiency of concurrent data structures is crucial to the performance of multi-threaded programs in shared-memory systems. The arbitrary execution of concurrent threads, however, can result in an incorrect behavior of these data structures. Graphics Processing Units (GPUs) have appeared as a powerful platform for high-performance computing. As regular data-parallel computations are straightforward to implement on traditional CPU architectures, it is challenging to implement them in a SIMD environment in the presence of thousands of active threads on GPU architectures. In this thesis, we implement a concurrent queue data structure and evaluate its performance on GPUs to understand how it behaves in a massively-parallel GPU environment. We implement both blocking and non-blocking approaches and compare their performance and behavior using both micro-benchmark and real-world application. We provide a complete evaluation and analysis of our implementations on an AMD Radeon R7 GPU. Our experiment shows that non-blocking approach outperforms blocking approach by up to 15.1 times when sufficient thread-level parallelism is present
Doctor of Philosophy
dissertationEmerging trends such as growing architectural diversity and increased emphasis on energy and power efficiency motivate the need for code that adapts to its execution context (input dataset and target architecture). Unfortunately, writing such code remains difficult, and is typically attempted only by a small group of motivated expert programmers who are highly knowledgeable about the relationship between software and its hardware mapping. In this dissertation, we introduce novel abstractions and techniques based on automatic performance tuning that enable both experts and nonexperts (application developers) to produce adaptive code. We present two new frameworks for adaptive programming: Nitro and Surge. Nitro enables expert programmers to specify code variants, or alternative implementations of the same computation, together with meta-information for selecting among them. It then utilizes supervised classification to select an optimal code variant at runtime based on characteristics of the execution context. Surge, on the other hand, provides a high-level nested data-parallel programming interface for application developers to specify computations. It then employs a two-level mechanism to automatically generate code variants and then tunes them using Nitro. The resulting code performs on par with or better than handcrafted reference implementations on both CPUs and GPUs. In addition to abstractions for expressing code variants, this dissertation also presents novel strategies for adaptively tuning them. First, we introduce a technique for dynamically selecting an optimal code variant at runtime based on characteristics of the input dataset. On five high-performance GPU applications, variants tuned using this strategy achieve over 93% of the performance of variants selected through exhaustive search. Next, we present a novel approach based on multitask learning to develop a code variant selection model on a target architecture from training on different source architectures. We evaluate this approach on a set of six benchmark applications and a collection of six NVIDIA GPUs from three distinct architecture generations. Finally, we implement support for combined code variant and frequency selection based on multiple objectives, including power and energy efficiency. Using this strategy, we construct a GPU sorting implementation that provides improved energy and power efficiency with less than a proportional drop in sorting throughput
Inter-workgroup barrier synchronisation on graphics processing units
GPUs are parallel devices that are able to run thousands of
independent threads concurrently. Traditional GPU programs are
data-parallel, requiring little to no communication,
i.e. synchronisation, between threads. However, classical concurrency
in the context of CPUs often exploits synchronisation idioms that are
not supported on GPUs. By studying such idioms on GPUs, with an aim to
facilitate them in a portable way, a wider and more generic space of
GPU applications can be made possible.
While the breadth of this thesis extends to many aspects of GPU
systems, the common thread throughout is the global barrier: an
execution barrier that synchronises all threads executing a GPU
application. The idea of such a barrier might seem straightforward,
however this investigation reveals many challenges and insights. In
particular, this thesis includes the following studies:
Execution models: while a general global barrier can deadlock due to
starvation on GPUs, it is shown that the scheduling guarantees of
current GPUs can be used to dynamically create an execution
environment that allows for a safe and portable global barrier
across a subset of the GPU threads.
Application optimisations: a set GPU optimisations are examined that
are tailored for graph applications, including one optimisation
enabled by the global barrier. It is shown that these optimisations
can provided substantial performance improvements, e.g. the barrier
optimisation achieves over a 10X speedup on AMD and Intel GPUs. The
performance portability of these optimisations is investigated, as
their utility varies across input, application, and architecture.
Multitasking: because many GPUs do not support preemption,
long-running GPU compute tasks (e.g. applications that use the
global barrier) may block other GPU functions, including graphics. A
simple cooperative multitasking scheme is proposed that allows
graphics tasks to meet their deadlines with reasonable overheads.Open Acces
Parallel breadth first search on GPU clusters
pre-printFast, scalable, low-cost, and low-power execution of parallel graph algorithms is important for a wide variety of commercial and public sector applications. Breadth First Search (BFS) imposes an extreme burden on memory bandwidth and network communications and has been proposed as a benchmark that may be used to evaluate current and future parallel computers. Hardware trends and manufacturing limits strongly imply that many-core devices, such as NVIDIA® GPUs and the Intel ® Xeon Phi ® , will become central components of such future systems. GPUs are well known to deliver the highest FLOPS/watt and enjoy a very significant memory bandwidth advantage over CPU architectures. Recent work has demonstrated that GPUs can deliver high performance for parallel graph algorithms and, further, that it is possible to encapsulate that capability in a manner that hides the low level details of the GPU architecture and the CUDA language but preserves the high throughput of the GPU. We extend previous research on GPUs and on scalable graph processing on supercomputers and demonstrate that a high-performance parallel graph machine can be created using commodity GPUs and networking hardware
Incremental closeness centrality in distributed memory
Networks are commonly used to model traffic patterns, social interactions, or web pages. The vertices in a network do not possess the same characteristics: some vertices are naturally more connected and some vertices can be more important. Closeness centrality (CC) is a global metric that quantifies how important is a given vertex in the network. When the network is dynamic and keeps changing, the relative importance of the vertices also changes. The best known algorithm to compute the CC scores makes it impractical to recompute them from scratch after each modification. In this paper, we propose Streamer, a distributed memory framework for incrementally maintaining the closeness centrality scores of a network upon changes. It leverages pipelined, replicated parallelism, and SpMM-based BFSs, and it takes NUMA effects into account. It makes maintaining the Closeness Centrality values of real-life networks with millions of interactions significantly faster and obtains almost linear speedups on a 64 nodes 8 threads/node cluster
A Fine-grained Performance Model for GPU Architectures
The increasing programmability, performance, and cost/effectiveness of GPUs have led to a widespread use of such many-core architectures to accelerate general purpose applications. Nevertheless, tuning applications to efficiently exploit the GPU potentiality is a very challenging task, especially for inexperienced programmers. This is due to the difficulty of developing a SW application for the specific GPU architectural configuration, which includes managing the memory hierarchy and optimizing the execution of thousands of concurrent threads while maintaining the semantic correctness of the application. Even though several profiling tools exist, which provide programmerswith a large number of metrics and measurements, it is often difficult to interpret such information for effectively tuning the application. This paper presents a performance model that allows accurately estimating the potential performance of the application under tuning on a given GPU device and, at the same time, it provides programmers with interpretable profiling hints. The paper shows the results obtained by applying the proposedmodel for profiling commonly used primitives and real codes
- …