131,170 research outputs found

    Iso-energy-efficiency: An approach to power-constrained parallel computation

    Get PDF
    Future large scale high performance supercomputer systems require high energy efficiency to achieve exaflops computational power and beyond. Despite the need to understand energy efficiency in high-performance systems, there are few techniques to evaluate energy efficiency at scale. In this paper, we propose a system-level iso-energy-efficiency model to analyze, evaluate and predict energy-performance of data intensive parallel applications with various execution patterns running on large scale power-aware clusters. Our analytical model can help users explore the effects of machine and application dependent characteristics on system energy efficiency and isolate efficient ways to scale system parameters (e.g. processor count, CPU power/frequency, workload size and network bandwidth) to balance energy use and performance. We derive our iso-energy-efficiency model and apply it to the NAS Parallel Benchmarks on two power-aware clusters. Our results indicate that the model accurately predicts total system energy consumption within 5% error on average for parallel applications with various execution and communication patterns. We demonstrate effective use of the model for various application contexts and in scalability decision-making

    Propagation and Decay of Injected One-Off Delays on Clusters: A Case Study

    Full text link
    Analytic, first-principles performance modeling of distributed-memory applications is difficult due to a wide spectrum of random disturbances caused by the application and the system. These disturbances (commonly called "noise") destroy the assumptions of regularity that one usually employs when constructing simple analytic models. Despite numerous efforts to quantify, categorize, and reduce such effects, a comprehensive quantitative understanding of their performance impact is not available, especially for long delays that have global consequences for the parallel application. In this work, we investigate various traces collected from synthetic benchmarks that mimic real applications on simulated and real message-passing systems in order to pinpoint the mechanisms behind delay propagation. We analyze the dependence of the propagation speed of idle waves emanating from injected delays with respect to the execution and communication properties of the application, study how such delays decay under increased noise levels, and how they interact with each other. We also show how fine-grained noise can make a system immune against the adverse effects of propagating idle waves. Our results contribute to a better understanding of the collective phenomena that manifest themselves in distributed-memory parallel applications.Comment: 10 pages, 9 figures; title change

    WMTrace : a lightweight memory allocation tracker and analysis framework

    Get PDF
    The diverging gap between processor and memory performance has been a well discussed aspect of computer architecture literature for some years. The use of multi-core processor designs has, however, brought new problems to the design of memory architectures - increased core density without matched improvement in memory capacity is reduc- ing the available memory per parallel process. Multiple cores accessing memory simultaneously degrades performance as a result of resource con- tention for memory channels and physical DIMMs. These issues combine to ensure that memory remains an on-going challenge in the design of parallel algorithms which scale. In this paper we present WMTrace, a lightweight tool to trace and analyse memory allocation events in parallel applications. This tool is able to dynamically link to pre-existing application binaries requiring no source code modification or recompilation. A post-execution analysis stage enables in-depth analysis of traces to be performed allowing memory allocations to be analysed by time, size or function. The second half of this paper features a case study in which we apply WMTrace to five parallel scientific applications and benchmarks, demonstrating its effectiveness at recording high-water mark memory consumption as well as memory use per-function over time. An in-depth analysis is provided for an unstructured mesh benchmark which reveals significant memory allocation imbalance across its participating processes

    RPPM : Rapid Performance Prediction of Multithreaded workloads on multicore processors

    Get PDF
    Analytical performance modeling is a useful complement to detailed cycle-level simulation to quickly explore the design space in an early design stage. Mechanistic analytical modeling is particularly interesting as it provides deep insight and does not require expensive offline profiling as empirical modeling. Previous work in mechanistic analytical modeling, unfortunately, is limited to single-threaded applications running on single-core processors. This work proposes RPPM, a mechanistic analytical performance model for multi-threaded applications on multicore hardware. RPPM collects microarchitecture-independent characteristics of a multi-threaded workload to predict performance on a previously unseen multicore architecture. The profile needs to be collected only once to predict a range of processor architectures. We evaluate RPPM's accuracy against simulation and report a performance prediction error of 11.2% on average (23% max). We demonstrate RPPM's usefulness for conducting design space exploration experiments as well as for analyzing parallel application performance

    Garbage collection auto-tuning for Java MapReduce on Multi-Cores

    Get PDF
    MapReduce has been widely accepted as a simple programming pattern that can form the basis for efficient, large-scale, distributed data processing. The success of the MapReduce pattern has led to a variety of implementations for different computational scenarios. In this paper we present MRJ, a MapReduce Java framework for multi-core architectures. We evaluate its scalability on a four-core, hyperthreaded Intel Core i7 processor, using a set of standard MapReduce benchmarks. We investigate the significant impact that Java runtime garbage collection has on the performance and scalability of MRJ. We propose the use of memory management auto-tuning techniques based on machine learning. With our auto-tuning approach, we are able to achieve MRJ performance within 10% of optimal on 75% of our benchmark tests

    FT-GReLoSSS: a Skeletal-Based Approach towards Application Parallelization and Low-Overhead Fault Tolerance

    Get PDF
    International audienceFT-GReLoSSS (FTG) is a C++/MPI framework to ease the development of fault-tolerant parallel applications belonging to a SPMD family termed GReLoSSS. The originality of FTG is to rely on the MoLOToF programming model principles to facilitate the addition of an efficient checkpoint-based fault tolerance at the application level. Main features of MoLOToF encompass a structured application development based on fault tolerant "skeletons" and lay emphasis on collaborations. The latter exist between the programmer, the framework and the underlying runtime middleware/environment. Together with the structured approach they contribute into achieving reduced checkpoint sizes, as well as reduced checkpoint and recovery overhead at runtime. This paper introduces the main principles of MoLOToF and the design of the FTG framework. To properly assess the framework's ease of use for a programmer as well as fault tolerance efficiency, a series of benchmarks were conducted up to 128 nodes on a multicore PC cluster. These benchmarks involved an existing parallel financial application for gas storage valuation, originally developed in collaboration with EDF company, and a rewritten version which made use of the FTG framework and its features. Experiments results display low-overhead compared to existing system-level counterparts

    STR2RTS: Refactored StreamIT Benchmarks into Statically Analyzable Parallel Benchmarks for WCET Estimation & Real-Time Scheduling

    Get PDF
    International audienceWe all had quite a time to find non-proprietary architecture-independent exploitable parallel benchmarks for Worst-Case Execution Time (WCET) estimation and real-time scheduling. However , there is no consensus on a parallel benchmark suite, when compared to the single-core era and the Mälardalen benchmark suite [12]. This document bridges part of this gap, by presenting a collection of benchmarks with the following good properties: (i) easily analyzable by static WCET estimation tools (written in structured C language, in particular neither goto nor dynamic memory allocation, containing flow information such as loop bounds); (ii) independent from any particular run-time system (MPI, OpenMP) or real-time operating system. Each benchmark is composed of the C source code of its tasks, and an XML description describing the structure of the application (tasks and amount of data exchanged between them when applicable). Each benchmark can be integrated in a full end-to-end empirical method validation protocol on multi-core architecture. This proposed collection of benchmarks is derived from the well known StreamIT [21] benchmark suite and will be integrated in the TACleBench suite [11] in a near future. All these benchmarks are available at https://gitlab.inria.fr/brouxel/STR2RTS
    corecore