904 research outputs found

    Tackling Exascale Software Challenges in Molecular Dynamics Simulations with GROMACS

    Full text link
    GROMACS is a widely used package for biomolecular simulation, and over the last two decades it has evolved from small-scale efficiency to advanced heterogeneous acceleration and multi-level parallelism targeting some of the largest supercomputers in the world. Here, we describe some of the ways we have been able to realize this through the use of parallelization on all levels, combined with a constant focus on absolute performance. Release 4.6 of GROMACS uses SIMD acceleration on a wide range of architectures, GPU offloading acceleration, and both OpenMP and MPI parallelism within and between nodes, respectively. The recent work on acceleration made it necessary to revisit the fundamental algorithms of molecular simulation, including the concept of neighborsearching, and we discuss the present and future challenges we see for exascale simulation - in particular a very fine-grained task parallelism. We also discuss the software management, code peer review and continuous integration testing required for a project of this complexity.Comment: EASC 2014 conference proceedin

    Deterministic Consistency: A Programming Model for Shared Memory Parallelism

    Full text link
    The difficulty of developing reliable parallel software is generating interest in deterministic environments, where a given program and input can yield only one possible result. Languages or type systems can enforce determinism in new code, and runtime systems can impose synthetic schedules on legacy parallel code. To parallelize existing serial code, however, we would like a programming model that is naturally deterministic without language restrictions or artificial scheduling. We propose "deterministic consistency", a parallel programming model as easy to understand as the "parallel assignment" construct in sequential languages such as Perl and JavaScript, where concurrent threads always read their inputs before writing shared outputs. DC supports common data- and task-parallel synchronization abstractions such as fork/join and barriers, as well as non-hierarchical structures such as producer/consumer pipelines and futures. A preliminary prototype suggests that software-only implementations of DC can run applications written for popular parallel environments such as OpenMP with low (<10%) overhead for some applications.Comment: 7 pages, 3 figure

    Large-Eddy Simulations of Flow and Heat Transfer in Complex Three-Dimensional Multilouvered Fins

    Get PDF
    The paper describes the computational procedure and results from large-eddy simulations in a complex three-dimensional louver geometry. The three-dimensionality in the louver geometry occurs along the height of the fin, where the angled louver transitions to the flat landing and joins with the tube surface. The transition region is characterized by a swept leading edge and decreasing flow area between louvers. Preliminary results show a high energy compact vortex jet forming in this region. The jet forms in the vicinity of the louver junction with the flat landing and is drawn under the louver in the transition region. Its interaction with the surface of the louver produces vorticity of the opposite sign, which aids in augmenting heat transfer on the louver surface. The top surface of the louver in the transition region experiences large velocities in the vicinity of the surface and exhibits higher heat transfer coefficients than the bottom surface.Air Conditioning and Refrigeration Project 9

    HDOT — An approach towards productive programming of hybrid applications

    Get PDF
    bulk synchronous parallel (BSP) communication model can hinder performance increases. This is due to the complexity to handle load imbalances, to reduce serialisation imposed by blocking communication patterns, to overlap communication with computation and, finally, to deal with increasing memory overheads. The MPI specification provides advanced features such as non-blocking calls or shared memory to mitigate some of these factors. However, applying these features efficiently usually requires significant changes on the application structure. Task parallel programming models are being developed as a means of mitigating the abovementioned issues but without requiring extensive changes on the application code. In this work, we present a methodology to develop hybrid applications based on tasks called hierarchical domain over-decomposition with tasking (HDOT). This methodology overcomes most of the issues found on MPI-only and traditional hybrid MPI+OpenMP applications. However, by emphasising the reuse of data partition schemes from process-level and applying them to task-level, it enables a natural coexistence between MPI and shared-memory programming models. The proposed methodology shows promising results in terms of programmability and performance measured on a set of applications.This work has been developed with the support of the European Union H2020 program through the INTERTWinE project (agreement number 671602); the Severo Ochoa Program awarded by the Spanish Government (SEV-2015-0493); the Generalitat de Catalunya (contract 2017-SGR-1414); and the Spanish Ministry of Science and Innovation (TIN2015-65316-P, Computaci on de Altas Prestaciones VII). The authors gratefully acknowledge Dr. Arnaud Mura, CNRS researcher at Institut PPRIME in France, for the numerical tool CREAMS. Finally, the manuscript has greatly bene ted from the precise comments of the reviewers.Peer ReviewedPostprint (author's final draft

    Methodology for malleable applications on distributed memory systems

    Get PDF
    A la portada logo BSC(English) The dominant programming approach for scientific and industrial computing on clusters is MPI+X. While there are a variety of approaches within the node, denoted by the ``X'', Message Passing interface (MPI) is the standard for programming multiple nodes with distributed memory. This thesis argues that the OmpSs-2 tasking model can be extended beyond the node to naturally support distributed memory, with three benefits: First, at small to medium scale the tasking model is a simpler and more productive alternative to MPI. It eliminates the need to distribute the data explicitly and convert all dependencies into explicit message passing. It also avoids the complexity of hybrid programming using MPI+X. Second, the ability to offload parts of the computation among the nodes enables the runtime to automatically balance the loads in a full-scale MPI+X program. This approach does not require a cost model, and it is able to transparently balance the computational loads across the whole program, on all its nodes. Third, because the runtime handles all low-level aspects of data distribution and communication, it can change the resource allocation dynamically, in a way that is transparent to the application. This thesis describes the design, development and evaluation of OmpSs-2@Cluster, a programming model and runtime system that extends the OmpSs-2 model to allow a virtually unmodified OmpSs-2 program to run across multiple distributed memory nodes. For well-balanced applications it provides similar performance to MPI+OpenMP on up to 16 nodes, and it improves performance by up to 2x for irregular and unbalanced applications like Cholesky factorization. This work also extended OmpSs-2@Cluster for interoperability with MPI and Barcelona Supercomputing Center (BSC)'s state-of-the-art Dynamic Load Balance (DLB) library in order to dynamically balance MPI+OmpSs-2 applications by transparently offloading tasks among nodes. This approach reduces the execution time of a microscale solid mechanics application by 46% on 64 nodes and on a synthetic benchmark, it is within 10% of perfect load balancing on up to 8 nodes. Finally, the runtime was extended to transparently support malleability for pure OmpSs-2@Cluster programs and interoperate with the Resources Management System (RMS). The only change to the application is to explicitly call an API function to control the addition or removal of nodes. In this regard we additionally provide the runtime with the ability to semi-transparently save and recover part of the application status to perform checkpoint and restart. Such a feature hides the complexity of data redistribution and parallel IO from the user while allowing the program to recover and continue previous executions. Our work is a starting point for future research on fault tolerance. In summary, OmpSs-2@Cluster expands the OmpSs-2 programming model to encompass distributed memory clusters. It allows an existing OmpSs-2 program, with few if any changes, to run across multiple nodes. OmpSs-2@Cluster supports transparent multi-node dynamic load balancing for MPI+OmpSs-2 programs, and enables semi-transparent malleability for OmpSs-2@Cluster programs. The runtime system has a high level of stability and performance, and it opens several avenues for future work.(Español) El modelo de programación dominante para clusters tanto en ciencia como industria es actualmente MPI+X. A pesar de que hay alguna variedad de alternativas para programar dentro de un nodo (indicado por la "X"), el estandar para programar múltiples nodos con memoria distribuida sigue siendo Message Passing Interface (MPI). Esta tesis propone la extensión del modelo de programación basado en tareas OmpSs-2 para su funcionamiento en sistemas de memoria distribuida, destacando 3 beneficios principales: En primer lugar; a pequeña y mediana escala, un modelo basado en tareas es más simple y productivo que MPI y elimina la necesidad de distribuir los datos explícitamente y convertir todas las dependencias en mensajes. Además, evita la complejidad de la programacion híbrida MPI+X. En segundo lugar; la capacidad de enviar partes del cálculo entre los nodos permite a la librería balancear la carga de trabajo en programas MPI+X a gran escala. Este enfoque no necesita un modelo de coste y permite equilibrar cargas transversalmente en todo el programa y todos los nodos. En tercer lugar; teniendo en cuenta que es la librería quien maneja todos los aspectos relacionados con distribución y transferencia de datos, es posible la modificación dinámica y transparente de los recursos que utiliza la aplicación. Esta tesis describe el diseño, desarrollo y evaluación de OmpSs-2@Cluster; un modelo de programación y librería que extiende OmpSs-2 permitiendo la ejecución de programas OmpSs-2 existentes en múltiples nodos sin prácticamente necesidad de modificarlos. Para aplicaciones balanceadas, este modelo proporciona un rendimiento similar a MPI+OpenMP hasta 16 nodos y duplica el rendimiento en aplicaciones irregulares o desbalanceadas como la factorización de Cholesky. Este trabajo incluye la extensión de OmpSs-2@Cluster para interactuar con MPI y la librería de balanceo de carga Dynamic Load Balancing (DLB) desarrollada en el Barcelona Supercomputing Center (BSC). De este modo es posible equilibrar aplicaciones MPI+OmpSs-2 mediante la transferencia transparente de tareas entre nodos. Este enfoque reduce el tiempo de ejecución de una aplicación de mecánica de sólidos a micro-escala en un 46% en 64 nodos; en algunos experimentos hasta 8 nodos se pudo equilibrar perfectamente la carga con una diferencia inferior al 10% del equilibrio perfecto. Finalmente, se implementó otra extensión de la librería para realizar operaciones de maleabilidad en programas OmpSs-2@Cluster e interactuar con el Sistema de Manejo de Recursos (RMS). El único cambio requerido en la aplicación es la llamada explicita a una función de la interfaz que controla la adición o eliminación de nodos. Además, se agregó la funcionalidad de guardar y recuperar parte del estado de la aplicación de forma semitransparente con el objetivo de realizar operaciones de salva-reinicio. Dicha funcionalidad oculta al usuario la complejidad de la redistribución de datos y las operaciones de lectura-escritura en paralelo, mientras permite al programa recuperar y continuar ejecuciones previas. Este es un punto de partida para futuras investigaciones en tolerancia a fallos. En resumen, OmpSs-2@Cluster amplía el modelo de programación de OmpSs-2 para abarcar sistemas de memoria distribuida. El modelo permite la ejecución de programas OmpSs-2 en múltiples nodos prácticamente sin necesidad de modificarlos. OmpSs-2@Cluster permite además el balanceo dinámico de carga en aplicaciones híbridas MPI+OmpSs-2 ejecutadas en varios nodos y es capaz de realizar maleabilidad semi-transparente en programas OmpSs-2@Cluster puros. La librería tiene un niveles de rendimiento y estabilidad altos y abre varios caminos para trabajos futuro.Arquitectura de computador

    Extreme scale parallel NBody algorithm with event driven constraint based execution model

    Get PDF
    Traditional scientific applications such as Computational Fluid Dynamics, Partial Differential Equations based numerical methods (like Finite Difference Methods, Finite Element Methods) achieve sufficient efficiency on state of the art high performance computing systems and have been widely studied / implemented using conventional programming models. For emerging application domains such as Graph applications scalability and efficiency is significantly constrained by the conventional systems and their supporting programming models. Furthermore technology trends like multicore, manycore, heterogeneous system architectures are introducing new challenges and possibilities. Emerging technologies are requiring a rethinking of approaches to more effectively expose the underlying parallelism to the applications and the end-users. This thesis explores the space of effective parallel execution of ephemeral graphs that are dynamically generated. The standard particle based simulation, solved using the Barnes-Hut algorithm is chosen to exemplify the dynamic workloads. In this thesis the workloads are expressed using sequential execution semantics, a conventional parallel programming model - shared memory semantics and semantics of an innovative execution model designed for efficient scalable performance towards Exascale computing called ParalleX. The main outcomes of this research are parallel processing of dynamic ephemeral workloads, enabling dynamic load balancing during runtime, and using advanced semantics for exposing parallelism in scaling constrained applications

    A review of parallel computing for large-scale remote sensing image mosaicking

    Get PDF
    Interest in image mosaicking has been spurred by a wide variety of research and management needs. However, for large-scale applications, remote sensing image mosaicking usually requires significant computational capabilities. Several studies have attempted to apply parallel computing to improve image mosaicking algorithms and to speed up calculation process. The state of the art of this field has not yet been summarized, which is, however, essential for a better understanding and for further research of image mosaicking parallelism on a large scale. This paper provides a perspective on the current state of image mosaicking parallelization for large scale applications. We firstly introduce the motivation of image mosaicking parallel for large scale application, and analyze the difficulty and problem of parallel image mosaicking at large scale such as scheduling with huge number of dependent tasks, programming with multiple-step procedure, dealing with frequent I/O operation. Then we summarize the existing studies of parallel computing in image mosaicking for large scale applications with respect to problem decomposition and parallel strategy, parallel architecture, task schedule strategy and implementation of image mosaicking parallelization. Finally, the key problems and future potential research directions for image mosaicking are addressed

    Dataflow Programming Paradigms for Computational Chemistry Methods

    Get PDF
    The transition to multicore and heterogeneous architectures has shaped the High Performance Computing (HPC) landscape over the past decades. With the increase in scale, complexity, and heterogeneity of modern HPC platforms, one of the grim challenges for traditional programming models is to sustain the expected performance at scale. By contrast, dataflow programming models have been growing in popularity as a means to deliver a good balance between performance and portability in the post-petascale era. This work introduces dataflow programming models for computational chemistry methods, and compares different dataflow executions in terms of programmability, resource utilization, and scalability. This effort is driven by computational chemistry applications, considering that they comprise one of the driving forces of HPC. In particular, many-body methods, such as Coupled Cluster methods (CC), which are the gold standard to compute energies in quantum chemistry, are of particular interest for the applied chemistry community. On that account, the latest development for CC methods is used as the primary vehicle for this research, but our effort is not limited to CC and can be applied across other application domains. Two programming paradigms for expressing CC methods into a dataflow form, in order to make them capable of utilizing task scheduling systems, are presented. Explicit dataflow, is the programming model where the dataflow is explicitly specified by the developer, is contrasted with implicit dataflow, where a task scheduling runtime derives the dataflow. An abstract model is derived to explore the limits of the different dataflow programming paradigms
    • …
    corecore