238 research outputs found

    Exploring performance and power properties of modern multicore chips via simple machine models

    Full text link
    Modern multicore chips show complex behavior with respect to performance and power. Starting with the Intel Sandy Bridge processor, it has become possible to directly measure the power dissipation of a CPU chip and correlate this data with the performance properties of the running code. Going beyond a simple bottleneck analysis, we employ the recently published Execution-Cache-Memory (ECM) model to describe the single- and multi-core performance of streaming kernels. The model refines the well-known roofline model, since it can predict the scaling and the saturation behavior of bandwidth-limited loop kernels on a multicore chip. The saturation point is especially relevant for considerations of energy consumption. From power dissipation measurements of benchmark programs with vastly different requirements to the hardware, we derive a simple, phenomenological power model for the Sandy Bridge processor. Together with the ECM model, we are able to explain many peculiarities in the performance and power behavior of multicore processors, and derive guidelines for energy-efficient execution of parallel programs. Finally, we show that the ECM and power models can be successfully used to describe the scaling and power behavior of a lattice-Boltzmann flow solver code.Comment: 23 pages, 10 figures. Typos corrected, DOI adde

    Evaluating Cache Coherent Shared Virtual Memory for Heterogeneous Multicore Chips

    Full text link
    The trend in industry is towards heterogeneous multicore processors (HMCs), including chips with CPUs and massively-threaded throughput-oriented processors (MTTOPs) such as GPUs. Although current homogeneous chips tightly couple the cores with cache-coherent shared virtual memory (CCSVM), this is not the communication paradigm used by any current HMC. In this paper, we present a CCSVM design for a CPU/MTTOP chip, as well as an extension of the pthreads programming model, called xthreads, for programming this HMC. Our goal is to evaluate the potential performance benefits of tightly coupling heterogeneous cores with CCSVM

    Stack-less SIMT reconvergence at low cost

    Get PDF
    Parallel architectures following the SIMT model such as GPUs benefit from application regularity by issuing concurrent threads running in lockstep on SIMD units. As threads take different paths across the control-flow graph, lockstep execution is partially lost, and must be regained whenever possible in order to maximize the occupancy of SIMD units. In this paper, we propose a technique to handle SIMT control divergence that operates in constant space and handles indirect jumps and recursion. We describe a possible implementation which leverage the existing memory divergence management unit, ensuring a low hardware cost. In terms of performance, this solution is at least as efficient as existing techniques

    Parallel HEVC Decoding on Multi- and Many-core Architectures : A Power and Performance Analysis

    Get PDF
    The Joint Collaborative Team on Video Decoding is developing a new standard named High Efficiency Video Coding (HEVC) that aims at reducing the bitrate of H.264/AVC by another 50 %. In order to fulfill the computational demands of the new standard, in particular for high resolutions and at low power budgets, exploiting parallelism is no longer an option but a requirement. Therefore, HEVC includes several coding tools that allows to divide each picture into several partitions that can be processed in parallel, without degrading the quality nor the bitrate. In this paper we adapt one of these approaches, the Wavefront Parallel Processing (WPP) coding, and show how it can be implemented on multi- and many-core processors. Our approach, named Overlapped Wavefront (OWF), processes several partitions as well as several pictures in parallel. This has the advantage that the amount of (thread-level) parallelism stays constant during execution. In addition, performance and power results are provided for three platforms: a server Intel CPU with 8 cores, a laptop Intel CPU with 4 cores, and a TILE-Gx36 with 36 cores from Tilera. The results show that our parallel HEVC decoder is capable of achieving an average frame rate of 116 fps for 4k resolution on a standard multicore CPU. The results also demonstrate that exploiting more parallelism by increasing the number of cores can improve the energy efficiency measured in terms of Joules per frame substantially

    Temperature Regulation in Multicore Processors Using Adjustable-Gain Integral Controllers

    Full text link
    This paper considers the problem of temperature regulation in multicore processors by dynamic voltage-frequency scaling. We propose a feedback law that is based on an integral controller with adjustable gain, designed for fast tracking convergence in the face of model uncertainties, time-varying plants, and tight computing-timing constraints. Moreover, unlike prior works we consider a nonlinear, time-varying plant model that trades off precision for simple and efficient on-line computations. Cycle-level, full system simulator implementation and evaluation illustrates fast and accurate tracking of given temperature reference values, and compares favorably with fixed-gain controllers.Comment: 8 pages, 6 figures, IEEE Conference on Control Applications 2015, Accepted Versio

    Measuring the Energy Consumption of Software written in C on x86-64 Processors

    Get PDF
    In 2016 German data centers consumed 12.4 terawatt-hours of electrical energy, which accounts for about 2% of Germany’s total energy consumption in that year. In 2020 this rose to 16 terawatt-hours or 2.9% of Germany’s total energy consumption in that year. The ever-increasing energy consumption of computers consequently leads to considerations to reduce it to save energy, money and to protect the environment. This thesis aims to answer fundamental questions about the energy consumption of software, e. g. how and how precise can a measurement be taken or if CPU load and energy consumption are correlated. An overview of measurement methods and the related software tooling was created. The most promising approach using software called 'Scaphandre' was chosen as the main basis and further developed. Different sorting algorithms were benchmarked to study their behavior regarding energy consumption. The resulting dataset was also used to answer the fundamental questions stated in the beginning. A replication and reproduction package was provided to enable the reproducibility of the results.Im Jahr 2016 verbrauchten deutsche Rechenzentren 12,4 Terawattstunden elektrische Energie, was etwa 2 % des gesamten Energieverbrauchs in Deutschland in diesem Jahr ausmacht. Im Jahr 2020 stieg dieser Wert auf 16 Terawattstunden bzw. 2,9 % des Gesamtenergieverbrauchs in Deutschland. Der stetig steigende Energieverbrauch von Computern fĂŒhrt folglich zu Überlegungen, diesen zu reduzieren, um Energie und Geld zu sparen und die Umwelt zu schĂŒtzen. Ziel dieser Arbeit ist es, grundlegende Fragen zum Energieverbrauch von Software zu beantworten, z. B. wie und mit welcher Genauigkeit gemessen werden kann oder ob CPU-Last und Energieverbrauch korrelieren. Es wurde eine Übersicht ĂŒber Messmethoden und die dazugehörigen Softwaretools erstellt. Der vielversprechendste Ansatz mit der Software 'Scaphandre' wurde als Hauptgrundlage ausgewĂ€hlt und weiterentwickelt. Verschiedene Sortieralgorithmen wurden einem Benchmarking unterzogen, um ihr Verhalten hinsichtlich des Energieverbrauchs zu untersuchen. Der resultierende Datensatz wurde auch zur Beantwortung der eingangs gestellten grundlegenden Fragen verwendet. Ein Replikations- und Reproduktionspaket wurde bereitgestellt, um die Reproduzierbarkeit der Ergebnisse zu ermöglichen
    • 

    corecore