Search CORE

1,193 research outputs found

Introducing Parallelism to the Ranges TS

Author: Brown Gordon
Di Bella Christopher
Haidl Michael
Remmelg Toomas
Reyes Ruyman
Steuwer Michel
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2018
Field of study

The current interface provided by the C++17 parallel algorithms poses some limitations with respect to parallel data access and heterogeneous systems, such as personal computers and server nodes with GPUs, smartphones, and embedded System on a Chip chipsets. In this paper, we present a summary of why we believe the Ranges TS solves these problems, and also improves both programmability and performance on heterogeneous platforms. The complete paper has been submitted to WG21 for consideration, and here we present a summary of the changes proposed alongside new performance results. To the best of our knowledge, this is the first paper presented to WG21 that unifies the Ranges TS with the parallel algorithms introduced in C++17. Although there are various points of intersection, we will focus on the composability of functions, and the benefit that this brings to accelerator devices via kernel fusion

Crossref

Enlighten

Redesigning OP2 Compiler to Use HPX Runtime Asynchronous Techniques

Author: Kaiser Hartmut
Khatami Zahra
Ramanujam J.
Publication venue
Publication date: 27/03/2017
Field of study

Maximizing parallelism level in applications can be achieved by minimizing overheads due to load imbalances and waiting time due to memory latencies. Compiler optimization is one of the most effective solutions to tackle this problem. The compiler is able to detect the data dependencies in an application and is able to analyze the specific sections of code for parallelization potential. However, all of these techniques provided with a compiler are usually applied at compile time, so they rely on static analysis, which is insufficient for achieving maximum parallelism and producing desired application scalability. One solution to address this challenge is the use of runtime methods. This strategy can be implemented by delaying certain amount of code analysis to be done at runtime. In this research, we improve the parallel application performance generated by the OP2 compiler by leveraging HPX, a C++ runtime system, to provide runtime optimizations. These optimizations include asynchronous tasking, loop interleaving, dynamic chunk sizing, and data prefetching. The results of the research were evaluated using an Airfoil application which showed a 40-50% improvement in parallel performance.Comment: 18th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2017

arXiv.org e-Print Archive

Crossref

funcX: A Federated Function Serving Fabric for Science

Author: Akkus I. E.
CMS
Forde J.
Fox G.
Hightower K.
Hindman B.
Malawski M.
Merkel D.
Spillner J.
Stubbs J.
Wang L.
Waterman D. G.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 07/05/2020
Field of study

Exploding data volumes and velocities, new computational methods and platforms, and ubiquitous connectivity demand new approaches to computation in the sciences. These new approaches must enable computation to be mobile, so that, for example, it can occur near data, be triggered by events (e.g., arrival of new data), be offloaded to specialized accelerators, or run remotely where resources are available. They also require new design approaches in which monolithic applications can be decomposed into smaller components, that may in turn be executed separately and on the most suitable resources. To address these needs we present funcX---a distributed function as a service (FaaS) platform that enables flexible, scalable, and high performance remote function execution. funcX's endpoint software can transform existing clouds, clusters, and supercomputers into function serving systems, while funcX's cloud-hosted service provides transparent, secure, and reliable function execution across a federated ecosystem of endpoints. We motivate the need for funcX with several scientific case studies, present our prototype design and implementation, show optimizations that deliver throughput in excess of 1 million functions per second, and demonstrate, via experiments on two supercomputers, that funcX can scale to more than more than 130000 concurrent workers.Comment: Accepted to ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC 2020). arXiv admin note: substantial text overlap with arXiv:1908.0490

arXiv.org e-Print Archive

Crossref

Recommended from our members

UPC++ v1.0 Programmer’s Guide, Revision 2020.3.0

Author: Bachan J
Baden S
Bonachea Dan
Grossman J
Hargrove P
Hofmeyr S
Jacquelin M
Kamil A
Van Straalen B
Publication venue: eScholarship, University of California
Publication date: 12/03/2020
Field of study

UPC++ is a C++11 library that provides Partitioned Global Address Space (PGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The PGAS model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. However, PGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ provides numerous methods for accessing and using global memory. In UPC++, all operations that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores

eScholarship - University of California

An extension to VORO++ for multithreaded computation of Voronoi cells

Author: Lazar Emanuel A.
Lu Jiayin
Rycroft Chris H.
Publication venue
Publication date: 08/07/2023
Field of study

VORO++ is a software library written in C++ for computing the Voronoi tessellation, a technique in computational geometry that is widely used for analyzing systems of particles. VORO++ was released in 2009 and is based on computing the Voronoi cell for each particle individually. Here, we take advantage of modern computer hardware, and extend the original serial version to allow for multithreaded computation of Voronoi cells via the OpenMP application programming interface. We test the performance of the code, and demonstrate that we can achieve parallel efficiencies greater than 95% in many cases. The multithreaded extension follows standard OpenMP programming paradigms, allowing it to be incorporated into other programs. We provide an example of this using the VoroTop software library, performing a multithreaded Voronoi cell topology analysis of up to 102.4 million particles.Comment: Fix typo and section number

arXiv.org e-Print Archive