Search CORE

1,340 research outputs found

Recommended from our members

UPC++ v1.0 Programmer’s Guide, Revision 2020.3.0

Author: Bachan J
Baden S
Bonachea Dan
Grossman J
Hargrove P
Hofmeyr S
Jacquelin M
Kamil A
Van Straalen B
Publication venue: eScholarship, University of California
Publication date: 12/03/2020
Field of study

UPC++ is a C++11 library that provides Partitioned Global Address Space (PGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The PGAS model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. However, PGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ provides numerous methods for accessing and using global memory. In UPC++, all operations that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores

eScholarship - University of California

Using GPI-2 for Distributed Memory Paralleliziation of the Caffe Toolbox to Speed up Deep Neural Network Training

Author: Keuper Janis
Kuehn Martin
Pfreundt Franz-Josef
Publication venue
Publication date: 18/08/2017
Field of study

Deep Neural Network (DNN) are currently of great inter- est in research and application. The training of these net- works is a compute intensive and time consuming task. To reduce training times to a bearable amount at reasonable cost we extend the popular Caffe toolbox for DNN with an efficient distributed memory communication pattern. To achieve good scalability we emphasize the overlap of computation and communication and prefer fine granu- lar synchronization patterns over global barriers. To im- plement these communication patterns we rely on the the Global address space Programming Interface version 2 (GPI-2) communication library. This interface provides a light-weight set of asynchronous one-sided communica- tion primitives supplemented by non-blocking fine gran- ular data synchronization mechanisms. Therefore, Caf- feGPI is the name of our parallel version of Caffe. First benchmarks demonstrate better scaling behavior com- pared with other extensions, e.g., the Intel TM Caffe. Even within a single symmetric multiprocessing machine with four graphics processing units, the CaffeGPI scales bet- ter than the standard Caffe toolbox. These first results demonstrate that the use of standard High Performance Computing (HPC) hardware is a valid cost saving ap- proach to train large DDNs. I/O is an other bottleneck to work with DDNs in a standard parallel HPC setting, which we will consider in more detail in a forthcoming paper

arXiv.org e-Print Archive

Fraunhofer-ePrints

The AXIOM software layers

AXIOM project aims at developing a heterogeneous computing board (SMP-FPGA).The Software Layers developed at the AXIOM project are explained.OmpSs provides an easy way to execute heterogeneous codes in multiple cores. People and objects will soon share the same digital network for information exchange in a world named as the age of the cyber-physical systems. The general expectation is that people and systems will interact in real-time. This poses pressure onto systems design to support increasing demands on computational power, while keeping a low power envelop. Additionally, modular scaling and easy programmability are also important to ensure these systems to become widespread. The whole set of expectations impose scientific and technological challenges that need to be properly addressed.The AXIOM project (Agile, eXtensible, fast I/O Module) will research new hardware/software architectures for cyber-physical systems to meet such expectations. The technical approach aims at solving fundamental problems to enable easy programmability of heterogeneous multi-core multi-board systems. AXIOM proposes the use of the task-based OmpSs programming model, leveraging low-level communication interfaces provided by the hardware. Modular scalability will be possible thanks to a fast interconnect embedded into each module. To this aim, an innovative ARM and FPGA-based board will be designed, with enhanced capabilities for interfacing with the physical world. Its effectiveness will be demonstrated with key scenarios such as Smart Video-Surveillance and Smart Living/Home (domotics).Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Archivio della Ricerca - Università degli Studi di Siena

Teaching Parallel Programming Using Java

Author: Akhtar Aleem
Carpenter Bryan
Javed Ansar
Shafi Aamir
Publication venue
Publication date: 01/01/2014
Field of study

This paper presents an overview of the "Applied Parallel Computing" course taught to final year Software Engineering undergraduate students in Spring 2014 at NUST, Pakistan. The main objective of the course was to introduce practical parallel programming tools and techniques for shared and distributed memory concurrent systems. A unique aspect of the course was that Java was used as the principle programming language. The course was divided into three sections. The first section covered parallel programming techniques for shared memory systems that include multicore and Symmetric Multi-Processor (SMP) systems. In this section, Java threads was taught as a viable programming API for such systems. The second section was dedicated to parallel programming tools meant for distributed memory systems including clusters and network of computers. We used MPJ Express-a Java MPI library-for conducting programming assignments and lab work for this section. The third and the final section covered advanced topics including the MapReduce programming model using Hadoop and the General Purpose Computing on Graphics Processing Units (GPGPU).Comment: 8 Pages, 6 figures, MPJ Express, MPI Java, Teaching Parallel Programmin

arXiv.org e-Print Archive

Crossref

Portsmouth University Research Portal (Pure)

Overlapping communication and computation by using a hybrid MPI/SMPSs approach

Author: Ayguadé Parra Eduard
Labarta Mancho Jesús José
Marjanovic Vladimir
Pérez Josep M.
Valero Cortés Mateo
Publication venue
Publication date: 01/01/2009
Field of study

A previous version of this document was submitted for publication by october 2008.Communication overhead is one of the dominant factors that affect performance in high-performance computing systems. To reduce the negative impact of communication, programmers overlap communication and computation by using asynchronous communication primitives. This increases code complexity, requiring more effort to write parallel code and making less readable code. This paper presents the hybrid use of MPI and SMPSs (SMP superscalar), a task-based shared-memory programming model, enhanced with a restart mechanism allowing the programmer to introduce the asynchronism that is necessary to enable the effective communication/computation overlap in a productive way. We demonstrate the hybrid use of MPI/SMPSs with the high-performance LINPACK benchmark, which uses the lookahead technique to overlap communication and computation. MPI/SMPSs improves the performance of a pure MPI with look-ahead by 7,6% on a 1024 processors machine. In addition to better performance, hybrid MPI/SMPSs substantially reduces code complexity, it is less sensitive to network bandwidth and operating system noise, and improves the use of main memory.Postprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Enabling and scaling biomolecular simulations of 100 million atoms on petascale machines with a multicore-optimized message-driven runtime

Author: Chao Mei
Chris Harrison
Eric J. Bohm
Gengbin Zheng
James C. Phillips
Laxmikant V. Kale
Yanhua Sun
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2011
Field of study

A 100-million-atom biomolecular simulation with NAMD is one of the three benchmarks for the NSF-funded sustainable petascale machine. Simulating this large molecular system on a petascale machine presents great challenges, including handling I/O, large memory footprint and getting good strong-scaling results. In this paper, we present parallel I/O techniques to enable the simula-tion. A new SMP model is designed to efficiently utilize ubiquitous wide multicore clusters by extending the CHARM++ asynchronous message-driven runtime. We exploit node-aware techniques to op-timize both the application and the underlying SMP runtime. Hi-erarchical load balancing is further exploited to scale NAMD to the full Jaguar PF Cray XT5 (224,076 cores) at Oak Ridge Na-tional Laboratory, both with and without PME full electrostatics, achieving 93 % parallel efficiency (vs 6720 cores) at 9 ms per step for a simple cutoff calculation. Excellent scaling is also obtained on 65,536 cores of the Intrepid Blue Gene/P at Argonne National Laboratory. 1

CiteSeerX

Crossref