125 research outputs found

    Tackling Exascale Software Challenges in Molecular Dynamics Simulations with GROMACS

    Full text link
    GROMACS is a widely used package for biomolecular simulation, and over the last two decades it has evolved from small-scale efficiency to advanced heterogeneous acceleration and multi-level parallelism targeting some of the largest supercomputers in the world. Here, we describe some of the ways we have been able to realize this through the use of parallelization on all levels, combined with a constant focus on absolute performance. Release 4.6 of GROMACS uses SIMD acceleration on a wide range of architectures, GPU offloading acceleration, and both OpenMP and MPI parallelism within and between nodes, respectively. The recent work on acceleration made it necessary to revisit the fundamental algorithms of molecular simulation, including the concept of neighborsearching, and we discuss the present and future challenges we see for exascale simulation - in particular a very fine-grained task parallelism. We also discuss the software management, code peer review and continuous integration testing required for a project of this complexity.Comment: EASC 2014 conference proceedin

    Implementation and scaling of the fully coupled Terrestrial Systems Modeling Platform (TerrSysMP) in a massively parallel supercomputing environment – a case study on JUQUEEN (IBM Blue Gene/Q)

    Get PDF
    Continental-scale hyper-resolution simulations constitute a grand challenge in characterizing non-linear feedbacks of states and fluxes of the coupled water, energy, and biogeochemical cycles of terrestrial systems. Tackling this challenge requires advanced coupling and supercomputing technologies for earth system models that are discussed in this study, utilizing the example of the implementation of the newly developed Terrestrial Systems Modeling Platform (TerrSysMP) on JUQUEEN (IBM Blue Gene/Q) of the Jülich Supercomputing Centre, Germany. The applied coupling strategies rely on the Multiple Program Multiple Data (MPMD) paradigm and require memory and load balancing considerations in the exchange of the coupling fields between different component models and allocation of computational resources, respectively. These considerations can be reached with advanced profiling and tracing tools leading to the efficient use of massively parallel computing environments, which is then mainly determined by the parallel performance of individual component models. However, the problem of model I/O and initialization in the peta-scale range requires major attention, because this constitutes a true big data challenge in the perspective of future exa-scale capabilities, which is unsolved

    Discovery of Potential Parallelism in Sequential Programs

    Get PDF
    In the era of multicore processors, the responsibility for performance gains has been shifted onto software developers. Once improvements of the sequential algorithm have been exhausted, software-managed parallelism is the only option left. However, writing parallel code is still difficult, especially when parallelizing sequential code written by someone else. A key task in this process is the identification of suitable parallelization targets in the source code. Parallelism discovery tools help developers to find such targets automatically. Unfortunately, tools that identify parallelism during compilation are usually conservative due to the lack of runtime information, and tools relying on runtime information primarily suffer from high overhead in terms of both time and memory. This dissertation presents a generic framework for parallelism discovery based on dynamic program analysis, supporting various types of parallelism while incurring practically affordable overhead. The framework contains two main components: an efficient data-dependence profiler and a set of parallelism discovery algorithms based on a language-independent concept called Computational Unit. The data-dependence profiler serves as the foundation of the parallelism discovery framework. Traditional dependence profiling approaches introduce a tremendous amount of time and memory overhead. To lower the overhead, current methods limit their scope to the subset of the dependence information needed for the analysis they have been created for, sacrificing generality and discouraging reuse. In contrast, the profiler shown in this thesis addresses the problem via signature-based memory management and a lock-free parallel design. It produces detailed dependences not only for sequential but also for multi-threaded code without causing prohibitive overhead, allowing it to serve as a generic base for various program analysis techniques. Computational Units (CUs) provide a language-independent foundation for parallelism discovery. CUs are computations that follow the read-compute-write pattern. Unlike other concepts, they are not restricted to predefined language constructs. A program is represented as a CU graph, in which vertexes are CUs and edges are data dependences. This allows parallelism to be detected that spreads across multiple language constructs, taking code refactoring into consideration. The parallelism discovery algorithms cover both loop and task parallelism. Results of our experiments show that 1) the efficient data-dependence profiler has a very competitive average slowdown of around 80× with accuracy higher than 99.6%; 2) the framework discovers parallelism with high accuracy, identifying 92.5% of the parallel loops in NAS benchmarks; 3) when parallelizing well-known open-source software following the outputs of the framework, reasonable speedups are obtained. Finally, use cases beyond parallelism discovery are briefly demonstrated to show the generality of the framework

    Adaptive Parallelism for Coupled, Multithreaded Message-Passing Programs

    Get PDF
    Hybrid parallel programming models that combine message passing (MP) and shared- memory multithreading (MT) are becoming more popular, especially with applications requiring higher degrees of parallelism and scalability. Consequently, coupled parallel programs, those built via the integration of independently developed and optimized software libraries linked into a single application, increasingly comprise message-passing libraries with differing preferred degrees of threading, resulting in thread-level heterogeneity. Retroactively matching threading levels between independently developed and maintained libraries is difficult, and the challenge is exacerbated because contemporary middleware services provide only static scheduling policies over entire program executions, necessitating suboptimal, over-subscribed or under-subscribed, configurations. In coupled applications, a poorly configured component can lead to overall poor application performance, suboptimal resource utilization, and increased time-to-solution. So it is critical that each library executes in a manner consistent with its design and tuning for a particular system architecture and workload. Therefore, there is a need for techniques that address dynamic, conflicting configurations in coupled multithreaded message-passing (MT-MP) programs. Our thesis is that we can achieve significant performance improvements over static under-subscribed approaches through reconfigurable execution environments that consider compute phase parallelization strategies along with both hardware and software characteristics. In this work, we present new ways to structure, execute, and analyze coupled MT- MP programs. Our study begins with an examination of contemporary approaches used to accommodate thread-level heterogeneity in coupled MT-MP programs. Here we identify potential inefficiencies in how these programs are structured and executed in the high-performance computing domain. We then present and evaluate a novel approach for accommodating thread-level heterogeneity. Our approach enables full utilization of all available compute resources throughout an application’s execution by providing programmable facilities with modest overheads to dynamically reconfigure runtime environments for compute phases with differing threading factors and affinities. Our performance results show that for a majority of the tested scientific workloads our approach and corresponding open-source reference implementation render speedups greater than 50 % over the static under-subscribed baseline. Motivated by our examination of reconfigurable execution environments and their memory overhead, we also study the memory attribution problem: the inability to predict or evaluate during runtime where the available memory is used across the software stack comprising the application, reusable software libraries, and supporting runtime infrastructure. Specifically, dynamic adaptation requires runtime intervention, which by its nature introduces additional runtime and memory overhead. To better understand the latter, we propose and evaluate a new way to quantify component-level memory usage from unmodified binaries dynamically linked to a message-passing communication library. Our experimental results show that our approach and corresponding implementation accurately measure memory resource usage as a function of time, scale, communication workload, and software or hardware system architecture, clearly distinguishing between application and communication library usage at a per-process level

    Type Oriented Parallel Programming

    Get PDF
    Context: Parallel computing is an important field within the sciences. With the emergence of multi, and soon many, core CPUs this is moving more and more into the domain of general computing. HPC programmers want performance, but at the moment this comes at a cost; parallel languages are either efficient or conceptually simple, but not both. Aim: To develop and evaluate a novel programming paradigm which will address the problem of parallel programming and allow for languages which are both conceptually simple and efficient. Method: A type-based approach, which allows the programmer to control all aspects of parallelism by the use and combination of types has been developed. As a vehicle to present and analyze this new paradigm a parallel language, Mesham, and associated compilation tools have also been created. By using types to express parallelism the programmer can exercise efficient, flexible control in a high level abstract model yet with a sufficiently rich amount of information in the source code upon which the compiler can perform static analysis and optimization. Results: A number of case studies have been implemented in Mesham. Official benchmarks have been performed which demonstrate the paradigm allows one to write code which is comparable, in terms of performance, with existing high performance solutions. Sections of the parallel simulation package, Gadget-2, have been ported into Mesham, where substantial code simplifications have been made. Conclusions: The results obtained indicate that the type-based approach does satisfy the aim of the research described in this thesis. By using this new paradigm the programmer has been able to write parallel code which is both simple and efficient
    • …
    corecore