Search CORE

3,413 research outputs found

Analyzing Communication Models for Distributed Thread-Collaborative Processors in Terms of Energy and Time

Author: Benjamin Klenk
Holger Fröning
Lena Oden
Publication venue
Publication date: 11/04/2020
Field of study

Abstract-Accelerated computing has become pervasive for increasing the computational power and energy efficiency in terms of GFLOPs/Watt. For application areas with highest demands, for instance high performance computing, data warehousing and high performance analytics, accelerators like GPUs or Intel's MICs are distributed throughout the cluster. Since current analyses and predictions show that data movement will be the main contributor to energy consumption, we are entering an era of communication-centric heterogeneous systems that are operating with hard power constraints. In this work, we analyze data movement optimizations for distributed heterogeneous systems based on CPUs and GPUs. Thread-collaborative processors like GPUs differ significantly in their execution model from generalpurpose processors like CPUs, but available communication models are still designed and optimized for CPUs. Similar to heterogeneity in processing, heterogeneity in communication can have a huge impact on energy and time. To analyze this impact, we use multiple workloads with distinct properties regarding computational intensity and communication characteristics. We show for which workloads tailored communication models are essential, not only reducing execution time but also saving energy. Exposing the impact in terms of energy and time for communication-centric heterogeneous systems is crucial for future optimizations, and this work is a first step in this direction

CiteSeerX

Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication

Author: Azad Ariful
Ballard Grey
Buluc Aydin
Demmel James
Grigori Laura
Schwartz Oded
Toledo Sivan
Williams Samuel
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 01/01/2016
Field of study

Sparse matrix-matrix multiplication (or SpGEMM) is a key primitive for many high-performance graph algorithms as well as for some linear solvers, such as algebraic multigrid. The scaling of existing parallel implementations of SpGEMM is heavily bound by communication. Even though 3D (or 2.5D) algorithms have been proposed and theoretically analyzed in the flat MPI model on Erdos-Renyi matrices, those algorithms had not been implemented in practice and their complexities had not been analyzed for the general case. In this work, we present the first ever implementation of the 3D SpGEMM formulation that also exploits multiple (intra-node and inter-node) levels of parallelism, achieving significant speedups over the state-of-the-art publicly available codes at all levels of concurrencies. We extensively evaluate our implementation and identify bottlenecks that should be subject to further research

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

eScholarship - University of California

Hal-Diderot

MULTI-SCALE SCHEDULING TECHNIQUES FOR SIGNAL PROCESSING SYSTEMS

Author: Zhou Zheng
Publication venue
Publication date: 01/01/2013
Field of study

A variety of hardware platforms for signal processing has emerged, from distributed systems such as Wireless Sensor Networks (WSNs) to parallel systems such as Multicore Programmable Digital Signal Processors (PDSPs), Multicore General Purpose Processors (GPPs), and Graphics Processing Units (GPUs) to heterogeneous combinations of parallel and distributed devices. When a signal processing application is implemented on one of those platforms, the performance critically depends on the scheduling techniques, which in general allocate computation and communication resources for competing processing tasks in the application to optimize performance metrics such as power consumption, throughput, latency, and accuracy. Signal processing systems implemented on such platforms typically involve multiple levels of processing and communication hierarchy, such as network-level, chip-level, and processor-level in a structural context, and application-level, subsystem-level, component-level, and operation- or instruction-level in a behavioral context. In this thesis, we target scheduling issues that carefully address and integrate scheduling considerations at different levels of these structural and behavioral hierarchies. The core contributions of the thesis include the following. Considering both the network-level and chip-level, we have proposed an adaptive scheduling algorithm for wireless sensor networks (WSNs) designed for event detection. Our algorithm exploits discrepancies among the detection accuracy of individual sensors, which are derived from a collaborative training process, to allow each sensor to operate in a more energy efficient manner while the network satisfies given constraints on overall detection accuracy. Considering the chip-level and processor-level, we incorporated both temperature and process variations to develop new scheduling methods for throughput maximization on multicore processors. In particular, we studied how to process a large number of threads with high speed and without violating a given maximum temperature constraint. We targeted our methods to multicore processors in which the cores may operate at different frequencies and different levels of leakage. We develop speed selection and thread assignment schedulers based on the notion of a core's steady state temperature. Considering the application-level, component-level and operation-level, we developed a new dataflow based design flow within the targeted dataflow interchange format (TDIF) design tool. Our new multiprocessor system-on-chip (MPSoC)-oriented design flow, called TDIF-PPG, is geared towards analysis and mapping of embedded DSP applications on MPSoCs. An important feature of TDIF-PPG is its capability to integrate graph level parallelism and actor level parallelism into the application mapping process. Here, graph level parallelism is exposed by the dataflow graph application representation in TDIF, and actor level parallelism is modeled by a novel model for multiprocessor dataflow graph implementation that we call the Parallel Processing Group (PPG) model. Building on the contribution above, we formulated a new type of parallel task scheduling problem called Parallel Actor Scheduling (PAS) for chip-level MPSoC mapping of DSP systems that are represented as synchronous dataflow (SDF) graphs. In contrast to traditional SDF-based scheduling techniques, which focus on exploiting graph level (inter-actor) parallelism, the PAS problem targets the integrated exploitation of both intra- and inter-actor parallelism for platforms in which individual actors can be parallelized across multiple processing units. We address a special case of the PAS problem in which all of the actors in the DSP application or subsystem being optimized can be parallelized. For this special case, we develop and experimentally evaluate a two-phase scheduling framework with three work flows --- particle swarm optimization with a mixed integer programming formulation, particle swarm optimization with a simulated annealing engine, and particle swarm optimization with a fast heuristic based on list scheduling. Then, we extend our scheduling framework to support general PAS problem which considers the actors cannot be parallelized

Proceedings of the First International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2014): Porto, Portugal

Author: Barbosa Jorge
Carretero Pérez Jesús
García Blas Francisco Javier
Morla Ricardo
Publication venue
Publication date: 01/01/2014
Field of study

Proceedings of: First International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2014). Porto (Portugal), August 27-28, 2014

Universidad Carlos III de Madrid e-Archivo

The Hipeac Vision, 2010

Author: Cohen Albert
De Bosschere Koen
De Sutter Bjorn
Duranton Marc
Falsafi Babak
Gaydadjiev Georgi
Katevenis Manolis
Maebe Jonas
Munk Harm
Navarro Nacho
Ramirez Alex
Temam Olivier
Valero Matero
Yehia Sami
Publication venue: HiPEAC
Publication date: 01/01/2010
Field of study

Archivsystem Ask23

Warp-Aware Adaptive Energy Efficiency Calibration for Multi-GPU Systems

Author: Cheng Lianglun
Song Xiaoyu
Wan Hai
Wang Tao
Wang Zhuowei
Zhao Wuqing
Publication venue: PDXScholar
Publication date: 01/08/2022
Field of study

Massive GPU acceleration processors have been used in high-performance computing systems. The Dennard-scaling has led to power and thermal constraints limiting the performance of such systems. The demand for both increased performance and energy-efficiency is highly desired. This paper presents a multi-layer low-power optimisation method for warps and tasks parallelisms. We present a dynamic frequency regulation scheme for performance parameters in terms of load balance and load imbalance. The method monitors the energy parameters in runtime and adjusts adaptively the voltage level to ensure the performance efficiency with energy reduction. The experimental results show that the multi-layer low-power optimisation with dynamic frequency regulation can achieve 40% energy consumption reduction with only 1.6% performance degradation, thus reducing 59% maximum energy consumption. It can further save about 30% energy consumption in comparison with the single-layer energy optimisation

PDXScholar (Portland State University)

Recommended from our members

Preparing sparse solvers for exascale computing.

Author: Anzt Hartwig
Boman Erik
Curfman McInnes Lois
Falgout Rob
Ghysels Pieter
Heroux Michael
Li Xiaoye
Meier Yang Ulrike
Rajamanickam Sivasankaran
Rupp Karl
Smith Barry
Tran Mills Richard
Yamazaki Ichitaro
Publication venue: eScholarship, University of California
Publication date: 01/03/2020
Field of study

Sparse solvers provide essential functionality for a wide variety of scientific applications. Highly parallel sparse solvers are essential for continuing advances in high-fidelity, multi-physics and multi-scale simulations, especially as we target exascale platforms. This paper describes the challenges, strategies and progress of the US Department of Energy Exascale Computing project towards providing sparse solvers for exascale computing platforms. We address the demands of systems with thousands of high-performance node devices where exposing concurrency, hiding latency and creating alternative algorithms become essential. The efforts described here are works in progress, highlighting current success and upcoming challenges. This article is part of a discussion meeting issue 'Numerical algorithms for high-performance computational science'

eScholarship - University of California

Energy-Aware System-Level Design of Cyber-Physical Systems

Author: Esparza Isasa José Antonio
Publication venue: 'Aarhus University Library'
Publication date: 12/08/2015
Field of study

Cyber-Physical Systems (CPSs) are heterogeneous systems in which one or several computational cores interact with the physical environment. This interaction is typically performed through electromechanical elements such as sensors and actuators. Many CPSs operate as part of a network and some of them present a constrained energy budget (for example, they are battery powered). Examples of energy constrained CPSs could be a mobile robot, the nodes that compose a Body Area Network or a pacemaker. The heterogeneity present in the composition of CPSs together with the constrained energy availability makes these systems challenging to design. A way to tackle both complexity and costs is the application of abstract modelling and simulation. This thesis proposed the application of modelling at the system level, taking energy consumption in the different kinds of subsystems into consideration. By adopting this cross disciplinary approach to energy consumption it is possible to decrease it effectively. The results of this thesis are a number of modelling guidelines and tool improvements to support this kind of holistic analysis, covering energy consumption in electromechanical, computation and communication subsystems. From a methodological point of view these have been framed within a V-lifecycle. Finally, this approach has been demonstrated on two case studies from the medical domain enabling the exploration of alternative systems architectures and producing energy consumption estimates to conduct trade-off analysis