Search CORE

10,946 research outputs found

An OpenSHMEM Implementation for the Adapteva Epiphany Coprocessor

Author: D Richie
D Richie
J Ross
JA Ross
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 11/08/2016
Field of study

This paper reports the implementation and performance evaluation of the OpenSHMEM 1.3 specification for the Adapteva Epiphany architecture within the Parallella single-board computer. The Epiphany architecture exhibits massive many-core scalability with a physically compact 2D array of RISC CPU cores and a fast network-on-chip (NoC). While fully capable of MPMD execution, the physical topology and memory-mapped capabilities of the core and network translate well to Partitioned Global Address Space (PGAS) programming models and SPMD execution with SHMEM.Comment: 14 pages, 9 figures, OpenSHMEM 2016: Third workshop on OpenSHMEM and Related Technologie

arXiv.org e-Print Archive

Crossref

DART-MPI: An MPI-based Implementation of a PGAS Runtime System

Author: Fürlinger Karl
Glass Colin W.
Gracia José
Idrees Kamran
Mhedheb Yousri
Tao Jie
Zhou Huan
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2014
Field of study

A Partitioned Global Address Space (PGAS) approach treats a distributed system as if the memory were shared on a global level. Given such a global view on memory, the user may program applications very much like shared memory systems. This greatly simplifies the tasks of developing parallel applications, because no explicit communication has to be specified in the program for data exchange between different computing nodes. In this paper we present DART, a runtime environment, which implements the PGAS paradigm on large-scale high-performance computing clusters. A specific feature of our implementation is the use of one-sided communication of the Message Passing Interface (MPI) version 3 (i.e. MPI-3) as the underlying communication substrate. We evaluated the performance of the implementation with several low-level kernels in order to determine overheads and limitations in comparison to the underlying MPI-3.Comment: 11 pages, International Conference on Partitioned Global Address Space Programming Models (PGAS14

arXiv.org e-Print Archive

Crossref

Hierarchy-Aware Message-Passing in the Upcoming Many-Core Era

Author: Carsten Clauss
Simon Pickartz
Stefan Lankes
Thomas Bemmerl
Publication venue: 'IntechOpen'
Publication date: 01/01/2012
Field of study

IntechOpen

Crossref

Publikationsserver der RWTH Aachen University

Implementation of MPICH on top of MPLi̲te

Author: Selvarajan Shoba
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2002
Field of study

The goal of this thesis is to develop a new Channel Interface device for the MPICH implementation of the MPI (Message Passing Interface) standard using MPLi̲te. MPLi̲te is a lightweight message-passing library that is not a full MPI implementation, but offers high performance. MPICH (Message Passing Interface CHameleon) is a full implementation of the MPI standard that has the p4 library as the underlying communication device for TCP/IP networks. By integrating MPLi̲te as a Channel Interface device in MPICH, a parallel programmer can utilize the full MPI implementation of MPICH as well as the high bandwidth offered by MPLi̲te. There are several layers in the MPICH library where one can tie a new device. The Channel Interface is the lowest layer that requires very few functions to add a new device. By attaching MPLi̲te to MPICH at the lowest level, the Channel Interface, almost all of the performance of the MPLi̲te library can be delivered to the applications using MPICH. MPLi̲te can be implemented either as a blocking or a non-blocking Channel Interface device. The performance was measured on two separate test clusters, the PC and the Alpha mini-clusters, having Gigabit Ethernet connections. The PC cluster has two 1.8 GHz Pentium 4 PCs and the Alpha cluster has two 500 MHz Compaq DS20 workstations. Different network interface cards like Netgear, TrendNet and SysKonnect Gigabit Ethernet cards were used for the measurements. Both the blocking and non-blocking MPICH-MPLi̲te Channel Interface devices perform close to raw TCP, whereas a performance loss of 25-30% is seen in the MPICH-p4 Channel Interface device for larger messages. The superior performance offered by the MPICH-MPLi̲te device compared to the MPICH-p4 device can be easily seen on the SysKonnect cards using jumbo frames. The throughput curve also improves considerably by increasing the Eager/Rendezvous threshold

Digital Repository @ Iowa State University (ISU)

Performance Evaluation of Specialized Hardware for Fast Global Operations on Distributed Memory Multicomputers

Author: Sankaran Rajesh Madukkarumukumana
Publication venue: PDXScholar
Publication date: 27/10/1995
Field of study

Workstation cluster multicomputers are increasingly being applied for solving scientific problems that require massive computing power. Parallel Virtual Machine (PVM) is a popular message-passing model used to program these clusters. One of the major performance limiting factors for cluster multicomputers is their inefficiency in performing parallel program operations involving collective communications. These operations include synchronization, global reduction, broadcast/multicast operations and orderly access to shared global variables. Hall has demonstrated that a .secondary network with wide tree topology and centralized coordination processors (COP) could improve the performance of global operations on a variety of distributed architectures [Hall94a]. My hypothesis was that the efficiency of many PVM applications on workstation clusters could be significantly improved by utilizing a COP system for collective communication operations. To test my hypothesis, I interfaced COP system with PVM. The interface software includes a virtual memory-mapped secondary network interface driver, and a function library which allows to use COP system in place of PVM function calls in application programs. My implementation makes it possible to easily port any existing PVM applications to perform fast global operations using the COP system. To evaluate the performance improvements of using a COP system, I measured cost of various PVM global functions, derived the cost of equivalent COP library global functions, and compared the results. To analyze the cost of global operations on overall execution time of applications, I instrumented a complex molecular dynamics PVM application and performed measurements. The measurements were performed for a sample cluster size of 5 and for message sizes up to 16 kilobytes. The comparison of PVM and COP system global operation performance clearly demonstrates that the COP system can speed up a variety of global operations involving small-to-medium sized messages by factors of 5-25. Analysis of the example application for a sample cluster size of 5 show that speedup provided by my global function libraries and the COP system reduces overall execution time for this and similar applications by above 1.5 times. Additionally, the performance improvement seen by applications increases as the cluster size increases, thus providing a scalable solution for performing global operations

PDXScholar (Portland State University)