10,050 research outputs found

    An OpenSHMEM Implementation for the Adapteva Epiphany Coprocessor

    Full text link
    This paper reports the implementation and performance evaluation of the OpenSHMEM 1.3 specification for the Adapteva Epiphany architecture within the Parallella single-board computer. The Epiphany architecture exhibits massive many-core scalability with a physically compact 2D array of RISC CPU cores and a fast network-on-chip (NoC). While fully capable of MPMD execution, the physical topology and memory-mapped capabilities of the core and network translate well to Partitioned Global Address Space (PGAS) programming models and SPMD execution with SHMEM.Comment: 14 pages, 9 figures, OpenSHMEM 2016: Third workshop on OpenSHMEM and Related Technologie

    Nonblocking collectives for scalable Java communications

    Get PDF
    This is the peer reviewed version of the following article: Ramos, S., Taboada, G. L., Expósito, R. R., & Touriño, J. (2015). Nonblocking collectives for scalable Java communications. Concurrency and Computation: Practice and Experience, 27(5), 1169-1187, which has been published in final form at https://doi.org/10.1002/cpe.3279. This article may be used for non-commercial purposes in accordance with Wiley Terms and Conditions for Use of Self-Archived Versions.[Abstract] This paper presents a Java implementation of the recently published MPI 3.0 nonblocking message passing collectives in order to analyze and assess the feasibility of taking advantage of these operations in shared memory systems using Java. Nonblocking collectives aim to exploit the overlapping between computation and communication for collective operations to increase scalability of message passing codes, as it has been carried out for nonblocking point‐to‐point primitives. This scalability has become crucial not only for clusters but also for shared memory systems because of the current trend of increasing the number of cores per chip, which is leading to the generalization of multi‐core and many‐core processors. Message passing libraries based on remote direct memory access, thread‐based progression, or implementing pure multi‐threading shared memory support could potentially benefit from the lack of imposed synchronization by nonblocking collectives. But, although the distributed memory scenario has been well studied, the shared memory one has not been tackled yet. Hence, nonblocking collectives support has been included in FastMPJ, a Message Passing in Java (MPJ) implementation, and evaluated on a representative shared memory system, obtaining significant improvements because of overlapping and lack of implicit synchronization, and with barely any overhead imposed over common blocking operations.Ministerio de Ciencia e Innovación; TIN2010-16735Xunta de Galicia; CN2012/211Xunta de Galicia; GRC2013/05

    DART-MPI: An MPI-based Implementation of a PGAS Runtime System

    Full text link
    A Partitioned Global Address Space (PGAS) approach treats a distributed system as if the memory were shared on a global level. Given such a global view on memory, the user may program applications very much like shared memory systems. This greatly simplifies the tasks of developing parallel applications, because no explicit communication has to be specified in the program for data exchange between different computing nodes. In this paper we present DART, a runtime environment, which implements the PGAS paradigm on large-scale high-performance computing clusters. A specific feature of our implementation is the use of one-sided communication of the Message Passing Interface (MPI) version 3 (i.e. MPI-3) as the underlying communication substrate. We evaluated the performance of the implementation with several low-level kernels in order to determine overheads and limitations in comparison to the underlying MPI-3.Comment: 11 pages, International Conference on Partitioned Global Address Space Programming Models (PGAS14

    Equilibrado de carga dirigido por modelos de Kernels de datos paralelos en plataformas heterogéneas de alto rendimiento

    Get PDF
    Las aplicaciones de datos paralelos se componen de varios procesos que aplican el mismo cómputo (kernel) a diferentes conjuntos de datos. Además, durante su ejecución, estas aplicaciones necesitan comunicar resultados parciales. Las plataformas heterogéneas son aquellas donde cada recurso de cómputo del sistema es probablemente diferente a los otros, y están compuestas por aceleradores. La conexión entre los elementos se realiza mediante redes de diferente rendimiento y características. Estos tienen que trabajar juntos para ejecutar una aplicación o resolver un problema, lo cual es lo complicado de este escenario. Por ello, el problema del equilibrado de carga de las aplicaciones paralelas de datos en plataformas heterogéneas se está investigando y resolviendo mediante distribuciones no uniformes de la carga de trabajo entre todos los recursos disponibles. Este problema se ha demostrado NP-Completo. La literatura ha desarrollado varias heurísticas para encontrar soluciones óptimas en las que diferentes modelos de rendimiento de computación y comunicación se utilizan como métrica en los algoritmos de partición. Los modelos nos permiten describir el funcionamiento del sistema, mientras que las heurísticas son el enfoque que se utiliza para encontrar una solución satisfactoria. Discutimos el papel de estos modelos y, finalmente para mejorar estos enfoques heurísticos, sustituimos métricas basadas en volumen de comunicaciones por una métrica basada en los tiempos de comunicaciones. Estos tiempos son obtenidos mediante un modelo analítico a través de una herramienta simbólica que manipula, evalúa y representa el coste de la comunicación de una partición con una expresión analítica utilizando el modelo de rendimiento de comunicación τ–Lop.Data-Parallel applications are composed of several processes that apply the same computation (kernel) to different amounts of data. While its execution, these applications need to communicate partial results. The heterogeneous platforms are those where each computation resource of the system is probably different from the others, and are composed of accelerators. The connection between the elements is made through networks of different performance and characteristics. These have to work together to execute an application or solve a problem, which is the complicated part of this scenario. Therefore, the load balancing problem of Data-Parallel applications in heterogeneous platforms is being investigated and solved by non-uniform distributions of the workload among all available resources. The objective of this solution is to find a partition that minimizes the cost of computation and communication, which is not trivial. This problem is demonstrated as NP-Complete. The literature has developed several heuristics to find optimal solutions where computation and communication performance models are used as metrics in the partitioning algorithms. The models allow us to describe the functioning of the system, while heuristics are the approach used to find a satisfactory solution. We discuss the role of these models and finally, to improve these heuristic approaches, we replace metrics based on communications volume with a metric based on communication times. These times are obtained through a symbolic tool that manipulates, evaluates and represents the cost of communication of a partition with an analytic expression using the communication performance model τ –Lop.Máster Universitario en Ingeniería Informática. Universidad de Extremadur

    Overlapping of Communication and Computation and Early Binding: Fundamental Mechanisms for Improving Parallel Performance on Clusters of Workstations

    Get PDF
    This study considers software techniques for improving performance on clusters of workstations and approaches for designing message-passing middleware that facilitate scalable, parallel processing. Early binding and overlapping of communication and computation are identified as fundamental approaches for improving parallel performance and scalability on clusters. Currently, cluster computers using the Message-Passing Interface for interprocess communication are the predominant choice for building high-performance computing facilities, which makes the findings of this work relevant to a wide audience from the areas of high-performance computing and parallel processing. The performance-enhancing techniques studied in this work are presently underutilized in practice because of the lack of adequate support by existing message-passing libraries and are also rarely considered by parallel algorithm designers. Furthermore, commonly accepted methods for performance analysis and evaluation of parallel systems omit these techniques and focus primarily on more obvious communication characteristics such as latency and bandwidth. This study provides a theoretical framework for describing early binding and overlapping of communication and computation in models for parallel programming. This framework defines four new performance metrics that facilitate new approaches for performance analysis of parallel systems and algorithms. This dissertation provides experimental data that validate the correctness and accuracy of the performance analysis based on the new framework. The theoretical results of this performance analysis can be used by designers of parallel system and application software for assessing the quality of their implementations and for predicting the effective performance benefits of early binding and overlapping. This work presents MPI/Pro, a new MPI implementation that is specifically optimized for clusters of workstations interconnected with high-speed networks. This MPI implementation emphasizes features such as persistent communication, asynchronous processing, low processor overhead, and independent message progress. These features are identified as critical for delivering maximum performance to applications. The experimental section of this dissertation demonstrates the capability of MPI/Pro to facilitate software techniques that result in significant application performance improvements. Specific demonstrations with Virtual Interface Architecture and TCP/IP over Ethernet are offered
    corecore