1,193 research outputs found
Introducing Parallelism to the Ranges TS
The current interface provided by the C++17 parallel algorithms poses some limitations with respect to parallel data access and heterogeneous systems, such as personal computers and server nodes with GPUs, smartphones, and embedded System on a Chip chipsets. In this paper, we present a summary of why we believe the Ranges TS solves these problems, and also improves both programmability and performance on heterogeneous platforms.
The complete paper has been submitted to WG21 for consideration, and here we present a summary of the changes proposed alongside new performance results.
To the best of our knowledge, this is the first paper presented to WG21 that unifies the Ranges TS with the parallel algorithms introduced in C++17. Although there are various points of intersection, we will focus on the composability of functions, and the benefit that this brings to accelerator devices via kernel fusion
Redesigning OP2 Compiler to Use HPX Runtime Asynchronous Techniques
Maximizing parallelism level in applications can be achieved by minimizing
overheads due to load imbalances and waiting time due to memory latencies.
Compiler optimization is one of the most effective solutions to tackle this
problem. The compiler is able to detect the data dependencies in an application
and is able to analyze the specific sections of code for parallelization
potential. However, all of these techniques provided with a compiler are
usually applied at compile time, so they rely on static analysis, which is
insufficient for achieving maximum parallelism and producing desired
application scalability. One solution to address this challenge is the use of
runtime methods. This strategy can be implemented by delaying certain amount of
code analysis to be done at runtime. In this research, we improve the parallel
application performance generated by the OP2 compiler by leveraging HPX, a C++
runtime system, to provide runtime optimizations. These optimizations include
asynchronous tasking, loop interleaving, dynamic chunk sizing, and data
prefetching. The results of the research were evaluated using an Airfoil
application which showed a 40-50% improvement in parallel performance.Comment: 18th IEEE International Workshop on Parallel and Distributed
Scientific and Engineering Computing (PDSEC 2017
funcX: A Federated Function Serving Fabric for Science
Exploding data volumes and velocities, new computational methods and
platforms, and ubiquitous connectivity demand new approaches to computation in
the sciences. These new approaches must enable computation to be mobile, so
that, for example, it can occur near data, be triggered by events (e.g.,
arrival of new data), be offloaded to specialized accelerators, or run remotely
where resources are available. They also require new design approaches in which
monolithic applications can be decomposed into smaller components, that may in
turn be executed separately and on the most suitable resources. To address
these needs we present funcX---a distributed function as a service (FaaS)
platform that enables flexible, scalable, and high performance remote function
execution. funcX's endpoint software can transform existing clouds, clusters,
and supercomputers into function serving systems, while funcX's cloud-hosted
service provides transparent, secure, and reliable function execution across a
federated ecosystem of endpoints. We motivate the need for funcX with several
scientific case studies, present our prototype design and implementation, show
optimizations that deliver throughput in excess of 1 million functions per
second, and demonstrate, via experiments on two supercomputers, that funcX can
scale to more than more than 130000 concurrent workers.Comment: Accepted to ACM Symposium on High-Performance Parallel and
Distributed Computing (HPDC 2020). arXiv admin note: substantial text overlap
with arXiv:1908.0490
Recommended from our members
UPC++ v1.0 Programmer’s Guide, Revision 2020.3.0
UPC++ is a C++11 library that provides Partitioned Global Address Space (PGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The PGAS model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. However, PGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ provides numerous methods for accessing and using global memory. In UPC++, all operations that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores
An extension to VORO++ for multithreaded computation of Voronoi cells
VORO++ is a software library written in C++ for computing the Voronoi
tessellation, a technique in computational geometry that is widely used for
analyzing systems of particles. VORO++ was released in 2009 and is based on
computing the Voronoi cell for each particle individually. Here, we take
advantage of modern computer hardware, and extend the original serial version
to allow for multithreaded computation of Voronoi cells via the OpenMP
application programming interface. We test the performance of the code, and
demonstrate that we can achieve parallel efficiencies greater than 95% in many
cases. The multithreaded extension follows standard OpenMP programming
paradigms, allowing it to be incorporated into other programs. We provide an
example of this using the VoroTop software library, performing a multithreaded
Voronoi cell topology analysis of up to 102.4 million particles.Comment: Fix typo and section number
- …