2,255 research outputs found
A hierarchically blocked Jacobi SVD algorithm for single and multiple graphics processing units
We present a hierarchically blocked one-sided Jacobi algorithm for the
singular value decomposition (SVD), targeting both single and multiple graphics
processing units (GPUs). The blocking structure reflects the levels of GPU's
memory hierarchy. The algorithm may outperform MAGMA's dgesvd, while retaining
high relative accuracy. To this end, we developed a family of parallel pivot
strategies on GPU's shared address space, but applicable also to inter-GPU
communication. Unlike common hybrid approaches, our algorithm in a single GPU
setting needs a CPU for the controlling purposes only, while utilizing GPU's
resources to the fullest extent permitted by the hardware. When required by the
problem size, the algorithm, in principle, scales to an arbitrary number of GPU
nodes. The scalability is demonstrated by more than twofold speedup for
sufficiently large matrices on a Tesla S2050 system with four GPUs vs. a single
Fermi card.Comment: Accepted for publication in SIAM Journal on Scientific Computin
Analysing the Performance of GPU Hash Tables for State Space Exploration
In the past few years, General Purpose Graphics Processors (GPUs) have been
used to significantly speed up numerous applications. One of the areas in which
GPUs have recently led to a significant speed-up is model checking. In model
checking, state spaces, i.e., large directed graphs, are explored to verify
whether models satisfy desirable properties. GPUexplore is a GPU-based model
checker that uses a hash table to efficiently keep track of already explored
states. As a large number of states is discovered and stored during such an
exploration, the hash table should be able to quickly handle many inserts and
queries concurrently. In this paper, we experimentally compare two different
hash tables optimised for the GPU, one being the GPUexplore hash table, and the
other using Cuckoo hashing. We compare the performance of both hash tables
using random and non-random data obtained from model checking experiments, to
analyse the applicability of the two hash tables for state space exploration.
We conclude that Cuckoo hashing is three times faster than GPUexplore hashing
for random data, and that Cuckoo hashing is five to nine times faster for
non-random data. This suggests great potential to further speed up GPUexplore
in the near future.Comment: In Proceedings GaM 2017, arXiv:1712.0834
Gunrock: GPU Graph Analytics
For large-scale graph analytics on the GPU, the irregularity of data access
and control flow, and the complexity of programming GPUs, have presented two
significant challenges to developing a programmable high-performance graph
library. "Gunrock", our graph-processing system designed specifically for the
GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on
operations on a vertex or edge frontier. Gunrock achieves a balance between
performance and expressiveness by coupling high performance GPU computing
primitives and optimization strategies with a high-level programming model that
allows programmers to quickly develop new graph primitives with small code size
and minimal GPU programming knowledge. We characterize the performance of
various optimization strategies and evaluate Gunrock's overall performance on
different GPU architectures on a wide range of graph primitives that span from
traversal-based algorithms and ranking algorithms, to triangle counting and
bipartite-graph-based algorithms. The results show that on a single GPU,
Gunrock has on average at least an order of magnitude speedup over Boost and
PowerGraph, comparable performance to the fastest GPU hardwired primitives and
CPU shared-memory graph libraries such as Ligra and Galois, and better
performance than any other GPU high-level graph library.Comment: 52 pages, invited paper to ACM Transactions on Parallel Computing
(TOPC), an extended version of PPoPP'16 paper "Gunrock: A High-Performance
Graph Processing Library on the GPU
Constraints Propagation on GPU: A Case Study for AllDifferent
The AllDifferent constraint is a fundamental tool in Constraint Programming. It naturally arises in many problems, from puzzles to scheduling and routing applications. Such popularity has prompted an extensive literature on filtering and propagation for this constraint. Motivated by the benefits that GPUs offer to other branches of AI, this paper investigates the use of GPUs to accelerate filtering and propagation. In particular, we present an efficient parallelization of the AllDifferent constraint on GPU; we analyze different design and implementation choices and evaluates the performance of the resulting system on medium to large instances of the Travelling Salesman Problem with encouraging results
High performance graph analysis on parallel architectures
PhD ThesisOver the last decade pharmacology has been developing computational
methods to enhance drug development and testing. A computational
method called network pharmacology uses graph analysis
tools to determine protein target sets that can lead on better targeted
drugs for diseases as Cancer. One promising area of network-based
pharmacology is the detection of protein groups that can produce
better e ects if they are targeted together by drugs. However, the
e cient prediction of such protein combinations is still a bottleneck
in the area of computational biology.
The computational burden of the algorithms used by such protein
prediction strategies to characterise the importance of such proteins
consists an additional challenge for the eld of network pharmacology.
Such computationally expensive graph algorithms as the all pairs
shortest path (APSP) computation can a ect the overall drug discovery
process as needed network analysis results cannot be given on
time. An ideal solution for these highly intensive computations could
be the use of super-computing. However, graph algorithms have datadriven
computation dictated by the structure of the graph and this
can lead to low compute capacity utilisation with execution times
dominated by memory latency.
Therefore, this thesis seeks optimised solutions for the real-world
graph problems of critical node detection and e ectiveness characterisation
emerged from the collaboration with a pioneer company in the
eld of network pharmacology as part of a Knowledge Transfer Partnership
(KTP) / Secondment (KTS). In particular, we examine how
genetic algorithms could bene t the prediction of protein complexes
where their removal could produce a more e ective 'druggable' impact.
Furthermore, we investigate how the problem of all pairs shortest
path (APSP) computation can be bene ted by the use of emerging
parallel hardware architectures as GPU- and FPGA- desktop-based
accelerators.
In particular, we address the problem of critical node detection with
the development of a heuristic search method. It is based on a genetic
algorithm that computes optimised node combinations where their removal
causes greater impact than common impact analysis strategies.
Furthermore, we design a general pattern for parallel network analysis
on multi-core architectures that considers graph's embedded properties.
It is a divide and conquer approach that decomposes a graph
into smaller subgraphs based on its strongly connected components
and computes the all pairs shortest paths concurrently on GPU. Furthermore,
we use linear algebra to design an APSP approach based
on the BFS algorithm. We use algebraic expressions to transform the
problem of path computation to multiple independent matrix-vector
multiplications that are executed concurrently on FPGA. Finally, we
analyse how the optimised solutions of perturbation analysis and parallel
graph processing provided in this thesis will impact the drug
discovery process.This research was part of a Knowledge Transfer Partnership (KTP)
and Knowledge Transfer Secondment (KTS) between e-therapeutics
PLC and Newcastle University. It was supported as a collaborative
project by e-therapeutics PLC and Technology Strategy boar
- …