3,442 research outputs found
Blocked All-Pairs Shortest Paths Algorithm on Intel Xeon Phi KNL Processor: A Case Study
Manycores are consolidating in HPC community as a way of improving
performance while keeping power efficiency. Knights Landing is the recently
released second generation of Intel Xeon Phi architecture. While optimizing
applications on CPUs, GPUs and first Xeon Phi's has been largely studied in the
last years, the new features in Knights Landing processors require the revision
of programming and optimization techniques for these devices. In this work, we
selected the Floyd-Warshall algorithm as a representative case study of graph
and memory-bound applications. Starting from the default serial version, we
show how data, thread and compiler level optimizations help the parallel
implementation to reach 338 GFLOPS.Comment: Computer Science - CACIC 2017. Springer Communications in Computer
and Information Science, vol 79
High performance graph analysis on parallel architectures
PhD ThesisOver the last decade pharmacology has been developing computational
methods to enhance drug development and testing. A computational
method called network pharmacology uses graph analysis
tools to determine protein target sets that can lead on better targeted
drugs for diseases as Cancer. One promising area of network-based
pharmacology is the detection of protein groups that can produce
better e ects if they are targeted together by drugs. However, the
e cient prediction of such protein combinations is still a bottleneck
in the area of computational biology.
The computational burden of the algorithms used by such protein
prediction strategies to characterise the importance of such proteins
consists an additional challenge for the eld of network pharmacology.
Such computationally expensive graph algorithms as the all pairs
shortest path (APSP) computation can a ect the overall drug discovery
process as needed network analysis results cannot be given on
time. An ideal solution for these highly intensive computations could
be the use of super-computing. However, graph algorithms have datadriven
computation dictated by the structure of the graph and this
can lead to low compute capacity utilisation with execution times
dominated by memory latency.
Therefore, this thesis seeks optimised solutions for the real-world
graph problems of critical node detection and e ectiveness characterisation
emerged from the collaboration with a pioneer company in the
eld of network pharmacology as part of a Knowledge Transfer Partnership
(KTP) / Secondment (KTS). In particular, we examine how
genetic algorithms could bene t the prediction of protein complexes
where their removal could produce a more e ective 'druggable' impact.
Furthermore, we investigate how the problem of all pairs shortest
path (APSP) computation can be bene ted by the use of emerging
parallel hardware architectures as GPU- and FPGA- desktop-based
accelerators.
In particular, we address the problem of critical node detection with
the development of a heuristic search method. It is based on a genetic
algorithm that computes optimised node combinations where their removal
causes greater impact than common impact analysis strategies.
Furthermore, we design a general pattern for parallel network analysis
on multi-core architectures that considers graph's embedded properties.
It is a divide and conquer approach that decomposes a graph
into smaller subgraphs based on its strongly connected components
and computes the all pairs shortest paths concurrently on GPU. Furthermore,
we use linear algebra to design an APSP approach based
on the BFS algorithm. We use algebraic expressions to transform the
problem of path computation to multiple independent matrix-vector
multiplications that are executed concurrently on FPGA. Finally, we
analyse how the optimised solutions of perturbation analysis and parallel
graph processing provided in this thesis will impact the drug
discovery process.This research was part of a Knowledge Transfer Partnership (KTP)
and Knowledge Transfer Secondment (KTS) between e-therapeutics
PLC and Newcastle University. It was supported as a collaborative
project by e-therapeutics PLC and Technology Strategy boar
Cellular Automata Applications in Shortest Path Problem
Cellular Automata (CAs) are computational models that can capture the
essential features of systems in which global behavior emerges from the
collective effect of simple components, which interact locally. During the last
decades, CAs have been extensively used for mimicking several natural processes
and systems to find fine solutions in many complex hard to solve computer
science and engineering problems. Among them, the shortest path problem is one
of the most pronounced and highly studied problems that scientists have been
trying to tackle by using a plethora of methodologies and even unconventional
approaches. The proposed solutions are mainly justified by their ability to
provide a correct solution in a better time complexity than the renowned
Dijkstra's algorithm. Although there is a wide variety regarding the
algorithmic complexity of the algorithms suggested, spanning from simplistic
graph traversal algorithms to complex nature inspired and bio-mimicking
algorithms, in this chapter we focus on the successful application of CAs to
shortest path problem as found in various diverse disciplines like computer
science, swarm robotics, computer networks, decision science and biomimicking
of biological organisms' behaviour. In particular, an introduction on the first
CA-based algorithm tackling the shortest path problem is provided in detail.
After the short presentation of shortest path algorithms arriving from the
relaxization of the CAs principles, the application of the CA-based shortest
path definition on the coordinated motion of swarm robotics is also introduced.
Moreover, the CA based application of shortest path finding in computer
networks is presented in brief. Finally, a CA that models exactly the behavior
of a biological organism, namely the Physarum's behavior, finding the
minimum-length path between two points in a labyrinth is given.Comment: To appear in the book: Adamatzky, A (Ed.) Shortest path solvers. From
software to wetware. Springer, 201
Gunrock: GPU Graph Analytics
For large-scale graph analytics on the GPU, the irregularity of data access
and control flow, and the complexity of programming GPUs, have presented two
significant challenges to developing a programmable high-performance graph
library. "Gunrock", our graph-processing system designed specifically for the
GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on
operations on a vertex or edge frontier. Gunrock achieves a balance between
performance and expressiveness by coupling high performance GPU computing
primitives and optimization strategies with a high-level programming model that
allows programmers to quickly develop new graph primitives with small code size
and minimal GPU programming knowledge. We characterize the performance of
various optimization strategies and evaluate Gunrock's overall performance on
different GPU architectures on a wide range of graph primitives that span from
traversal-based algorithms and ranking algorithms, to triangle counting and
bipartite-graph-based algorithms. The results show that on a single GPU,
Gunrock has on average at least an order of magnitude speedup over Boost and
PowerGraph, comparable performance to the fastest GPU hardwired primitives and
CPU shared-memory graph libraries such as Ligra and Galois, and better
performance than any other GPU high-level graph library.Comment: 52 pages, invited paper to ACM Transactions on Parallel Computing
(TOPC), an extended version of PPoPP'16 paper "Gunrock: A High-Performance
Graph Processing Library on the GPU
Fast network centrality analysis using GPUs
<p>Abstract</p> <p>Background</p> <p>With the exploding volume of data generated by continuously evolving high-throughput technologies, biological network analysis problems are growing larger in scale and craving for more computational power. General Purpose computation on Graphics Processing Units (GPGPU) provides a cost-effective technology for the study of large-scale biological networks. Designing algorithms that maximize data parallelism is the key in leveraging the power of GPUs.</p> <p>Results</p> <p>We proposed an efficient data parallel formulation of the All-Pairs Shortest Path problem, which is the key component for shortest path-based centrality computation. A betweenness centrality algorithm built upon this formulation was developed and benchmarked against the most recent GPU-based algorithm. Speedup between 11 to 19% was observed in various simulated scale-free networks. We further designed three algorithms based on this core component to compute closeness centrality, eccentricity centrality and stress centrality. To make all these algorithms available to the research community, we developed a software package <it>gpu</it>-<it>fan </it>(GPU-based Fast Analysis of Networks) for CUDA enabled GPUs. Speedup of 10-50Ă— compared with CPU implementations was observed for simulated scale-free networks and real world biological networks.</p> <p>Conclusions</p> <p><it>gpu</it>-<it>fan </it>provides a significant performance improvement for centrality computation in large-scale networks. Source code is available under the GNU Public License (GPL) at <url>http://bioinfo.vanderbilt.edu/gpu-fan/</url>.</p
Graph Algorithms on GPUs
This chapter introduces the topic of graph algorithms on GPUs. It starts by presenting and comparing the main important data structures and techniques applied for representing and analysing graphs on GPUs at the state of the art.It then presents the theory and an updated review of the most efficient implementations of graph algorithms for GPUs. In particular, the chapter focuses on graph traversal algorithms (breadth-first search), single-source shortest path(Djikstra, Bellman-Ford, delta stepping, hybrids), and all-pair shortest path (Floyd-Warshall). By the end of the chapter, load balancing and memory access techniques are discussed through an overview of their main issues and management techniques
Design Principles for Sparse Matrix Multiplication on the GPU
We implement two novel algorithms for sparse-matrix dense-matrix
multiplication (SpMM) on the GPU. Our algorithms expect the sparse input in the
popular compressed-sparse-row (CSR) format and thus do not require expensive
format conversion. While previous SpMM work concentrates on thread-level
parallelism, we additionally focus on latency hiding with instruction-level
parallelism and load-balancing. We show, both theoretically and experimentally,
that the proposed SpMM is a better fit for the GPU than previous approaches. We
identify a key memory access pattern that allows efficient access into both
input and output matrices that is crucial to getting excellent performance on
SpMM. By combining these two ingredients---(i) merge-based load-balancing and
(ii) row-major coalesced memory access---we demonstrate a 4.1x peak speedup and
a 31.7% geomean speedup over state-of-the-art SpMM implementations on
real-world datasets.Comment: 16 pages, 7 figures, International European Conference on Parallel
and Distributed Computing (Euro-Par) 201
Reconfigurable computing for large-scale graph traversal algorithms
This thesis proposes a reconfigurable computing approach for supporting parallel processing in large-scale graph traversal algorithms. Our approach is based on a reconfigurable hardware architecture which exploits the capabilities of both FPGAs (Field-Programmable Gate Arrays) and a multi-bank parallel memory subsystem.
The proposed methodology to accelerate graph traversal algorithms has been applied to three case studies, revealing that application-specific hardware customisations can benefit performance. A summary of our four contributions is as follows.
First, a reconfigurable computing approach to accelerate large-scale graph traversal algorithms. We propose a reconfigurable hardware architecture which decouples computation and communication while keeping multiple memory requests in flight at any given time, taking advantage of the high bandwidth of multi-bank memory subsystems.
Second, a demonstration of the effectiveness of our approach through two case studies: the breadth-first search algorithm, and a graphlet counting algorithm from bioinformatics. Both case studies involve graph traversal, but each of them adopts a different graph data representation.
Third, a method for using on-chip memory resources in FPGAs to reduce off-chip memory accesses for accelerating graph traversal algorithms, through a case-study of the All-Pairs Shortest-Paths algorithm. This case study has been applied to process human brain network data.
Fourth, an evaluation of an approach based on instruction-set extension for FPGA design against many-core GPUs (Graphics Processing Units), based on a set of benchmarks with different memory access characteristics. It is shown that while GPUs excel at streaming applications, the proposed approach can outperform GPUs in applications with poor locality characteristics, such as graph traversal problems.Open Acces
- …