14,477 research outputs found

    Tackling Exascale Software Challenges in Molecular Dynamics Simulations with GROMACS

    Full text link
    GROMACS is a widely used package for biomolecular simulation, and over the last two decades it has evolved from small-scale efficiency to advanced heterogeneous acceleration and multi-level parallelism targeting some of the largest supercomputers in the world. Here, we describe some of the ways we have been able to realize this through the use of parallelization on all levels, combined with a constant focus on absolute performance. Release 4.6 of GROMACS uses SIMD acceleration on a wide range of architectures, GPU offloading acceleration, and both OpenMP and MPI parallelism within and between nodes, respectively. The recent work on acceleration made it necessary to revisit the fundamental algorithms of molecular simulation, including the concept of neighborsearching, and we discuss the present and future challenges we see for exascale simulation - in particular a very fine-grained task parallelism. We also discuss the software management, code peer review and continuous integration testing required for a project of this complexity.Comment: EASC 2014 conference proceedin

    Workload Equity in Vehicle Routing Problems: A Survey and Analysis

    Full text link
    Over the past two decades, equity aspects have been considered in a growing number of models and methods for vehicle routing problems (VRPs). Equity concerns most often relate to fairly allocating workloads and to balancing the utilization of resources, and many practical applications have been reported in the literature. However, there has been only limited discussion about how workload equity should be modeled in VRPs, and various measures for optimizing such objectives have been proposed and implemented without a critical evaluation of their respective merits and consequences. This article addresses this gap with an analysis of classical and alternative equity functions for biobjective VRP models. In our survey, we review and categorize the existing literature on equitable VRPs. In the analysis, we identify a set of axiomatic properties that an ideal equity measure should satisfy, collect six common measures, and point out important connections between their properties and those of the resulting Pareto-optimal solutions. To gauge the extent of these implications, we also conduct a numerical study on small biobjective VRP instances solvable to optimality. Our study reveals two undesirable consequences when optimizing equity with nonmonotonic functions: Pareto-optimal solutions can consist of non-TSP-optimal tours, and even if all tours are TSP optimal, Pareto-optimal solutions can be workload inconsistent, i.e. composed of tours whose workloads are all equal to or longer than those of other Pareto-optimal solutions. We show that the extent of these phenomena should not be underestimated. The results of our biobjective analysis are valid also for weighted sum, constraint-based, or single-objective models. Based on this analysis, we conclude that monotonic equity functions are more appropriate for certain types of VRP models, and suggest promising avenues for further research.Comment: Accepted Manuscrip

    Achieving Efficient Strong Scaling with PETSc using Hybrid MPI/OpenMP Optimisation

    Full text link
    The increasing number of processing elements and decreas- ing memory to core ratio in modern high-performance platforms makes efficient strong scaling a key requirement for numerical algorithms. In order to achieve efficient scalability on massively parallel systems scientific software must evolve across the entire stack to exploit the multiple levels of parallelism exposed in modern architectures. In this paper we demonstrate the use of hybrid MPI/OpenMP parallelisation to optimise parallel sparse matrix-vector multiplication in PETSc, a widely used scientific library for the scalable solution of partial differential equations. Using large matrices generated by Fluidity, an open source CFD application code which uses PETSc as its linear solver engine, we evaluate the effect of explicit communication overlap using task-based parallelism and show how to further improve performance by explicitly load balancing threads within MPI processes. We demonstrate a significant speedup over the pure-MPI mode and efficient strong scaling of sparse matrix-vector multiplication on Fujitsu PRIMEHPC FX10 and Cray XE6 systems

    Scalable Layer-2/Layer-3 Multistage Switching Architectures for Software Routers

    Get PDF
    Software routers are becoming an important alternative to proprietary and expensive network devices, because they exploit the economy of scale of the PC market and open-source software. When considering maximum performance in terms of throughput, PC-based routers suffer from limitations stemming from the single PC architecture, e.g., limited bus bandwidth, and high memory access latency. To overcome these limitations, in this paper we present a multistage architecture that combines a layer-2 load-balancer front-end and a layer-3 routing back-end, interconnected by standard Ethernet switches. Both the front-end and the back-end are implemented using standard PCs and open- source software. After describing the architecture, evaluation is performed on a lab test-bed, to show its scalability. While the proposed solution allows to increase performance of PC- based routers, it also allows to distribute packet manipulation functionalities, and to automatically recover from component failures

    Multistage Switching Architectures for Software Routers

    Get PDF
    Software routers based on personal computer (PC) architectures are becoming an important alternative to proprietary and expensive network devices. However, software routers suffer from many limitations of the PC architecture, including, among others, limited bus and central processing unit (CPU) bandwidth, high memory access latency, limited scalability in terms of number of network interface cards, and lack of resilience mechanisms. Multistage PC-based architectures can be an interesting alternative since they permit us to i) increase the performance of single software routers, ii) scale router size, iii) distribute packet manipulation and control functionality, iv) recover from single-component failures, and v) incrementally upgrade router performance. We propose a specific multistage architecture, exploiting PC-based routers as switching elements, to build a high-speed, largesize,scalable, and reliable software router. A small-scale prototype of the multistage router is currently up and running in our labs, and performance evaluation is under wa

    Automated problem scheduling and reduction of synchronization delay effects

    Get PDF
    It is anticipated that in order to make effective use of many future high performance architectures, programs will have to exhibit at least a medium grained parallelism. A framework is presented for partitioning very sparse triangular systems of linear equations that is designed to produce favorable preformance results in a wide variety of parallel architectures. Efficient methods for solving these systems are of interest because: (1) they provide a useful model problem for use in exploring heuristics for the aggregation, mapping and scheduling of relatively fine grained computations whose data dependencies are specified by directed acrylic graphs, and (2) because such efficient methods can find direct application in the development of parallel algorithms for scientific computation. Simple expressions are derived that describe how to schedule computational work with varying degrees of granularity. The Encore Multimax was used as a hardware simulator to investigate the performance effects of using the partitioning techniques presented in shared memory architectures with varying relative synchronization costs

    PPF - A Parallel Particle Filtering Library

    Full text link
    We present the parallel particle filtering (PPF) software library, which enables hybrid shared-memory/distributed-memory parallelization of particle filtering (PF) algorithms combining the Message Passing Interface (MPI) with multithreading for multi-level parallelism. The library is implemented in Java and relies on OpenMPI's Java bindings for inter-process communication. It includes dynamic load balancing, multi-thread balancing, and several algorithmic improvements for PF, such as input-space domain decomposition. The PPF library hides the difficulties of efficient parallel programming of PF algorithms and provides application developers with the necessary tools for parallel implementation of PF methods. We demonstrate the capabilities of the PPF library using two distributed PF algorithms in two scenarios with different numbers of particles. The PPF library runs a 38 million particle problem, corresponding to more than 1.86 GB of particle data, on 192 cores with 67% parallel efficiency. To the best of our knowledge, the PPF library is the first open-source software that offers a parallel framework for PF applications.Comment: 8 pages, 8 figures; will appear in the proceedings of the IET Data Fusion & Target Tracking Conference 201
    corecore