13,583 research outputs found
Distributed-Memory Breadth-First Search on Massive Graphs
This chapter studies the problem of traversing large graphs using the
breadth-first search order on distributed-memory supercomputers. We consider
both the traditional level-synchronous top-down algorithm as well as the
recently discovered direction optimizing algorithm. We analyze the performance
and scalability trade-offs in using different local data structures such as CSR
and DCSC, enabling in-node multithreading, and graph decompositions such as 1D
and 2D decomposition.Comment: arXiv admin note: text overlap with arXiv:1104.451
Tackling Exascale Software Challenges in Molecular Dynamics Simulations with GROMACS
GROMACS is a widely used package for biomolecular simulation, and over the
last two decades it has evolved from small-scale efficiency to advanced
heterogeneous acceleration and multi-level parallelism targeting some of the
largest supercomputers in the world. Here, we describe some of the ways we have
been able to realize this through the use of parallelization on all levels,
combined with a constant focus on absolute performance. Release 4.6 of GROMACS
uses SIMD acceleration on a wide range of architectures, GPU offloading
acceleration, and both OpenMP and MPI parallelism within and between nodes,
respectively. The recent work on acceleration made it necessary to revisit the
fundamental algorithms of molecular simulation, including the concept of
neighborsearching, and we discuss the present and future challenges we see for
exascale simulation - in particular a very fine-grained task parallelism. We
also discuss the software management, code peer review and continuous
integration testing required for a project of this complexity.Comment: EASC 2014 conference proceedin
Managing Communication Latency-Hiding at Runtime for Parallel Programming Languages and Libraries
This work introduces a runtime model for managing communication with support
for latency-hiding. The model enables non-computer science researchers to
exploit communication latency-hiding techniques seamlessly. For compiled
languages, it is often possible to create efficient schedules for
communication, but this is not the case for interpreted languages. By
maintaining data dependencies between scheduled operations, it is possible to
aggressively initiate communication and lazily evaluate tasks to allow maximal
time for the communication to finish before entering a wait state. We implement
a heuristic of this model in DistNumPy, an auto-parallelizing version of
numerical Python that allows sequential NumPy programs to run on distributed
memory architectures. Furthermore, we present performance comparisons for eight
benchmarks with and without automatic latency-hiding. The results shows that
our model reduces the time spent on waiting for communication as much as 27
times, from a maximum of 54% to only 2% of the total execution time, in a
stencil application.Comment: PREPRIN
HERO: Heterogeneous Embedded Research Platform for Exploring RISC-V Manycore Accelerators on FPGA
Heterogeneous embedded systems on chip (HESoCs) co-integrate a standard host
processor with programmable manycore accelerators (PMCAs) to combine
general-purpose computing with domain-specific, efficient processing
capabilities. While leading companies successfully advance their HESoC
products, research lags behind due to the challenges of building a prototyping
platform that unites an industry-standard host processor with an open research
PMCA architecture. In this work we introduce HERO, an FPGA-based research
platform that combines a PMCA composed of clusters of RISC-V cores, implemented
as soft cores on an FPGA fabric, with a hard ARM Cortex-A multicore host
processor. The PMCA architecture mapped on the FPGA is silicon-proven,
scalable, configurable, and fully modifiable. HERO includes a complete software
stack that consists of a heterogeneous cross-compilation toolchain with support
for OpenMP accelerator programming, a Linux driver, and runtime libraries for
both host and PMCA. HERO is designed to facilitate rapid exploration on all
software and hardware layers: run-time behavior can be accurately analyzed by
tracing events, and modifications can be validated through fully automated hard
ware and software builds and executed tests. We demonstrate the usefulness of
HERO by means of case studies from our research
- …