753 research outputs found
DRLCap: Runtime GPU Frequency Capping with Deep Reinforcement Learning
Power and energy consumption is the limiting factor of modern computing systems. As the GPU becomes a mainstream computing device, power management for GPUs becomes increasingly important. Current works focus on GPU kernel-level power management, with challenges in portability due to architecture-specific considerations. We present DRLCap , a general runtime power management framework intended to support power management across various GPU architectures. It periodically monitors system-level information to dynamically detect program phase changes and model the workload and GPU system behavior. This elimination from kernel-specific constraints enhances adaptability and responsiveness. The framework leverages dynamic GPU frequency capping, which is the most widely used power knob, to control the power consumption. DRLCap employs deep reinforcement learning (DRL) to adapt to the changing of program phases by automatically adjusting its power policy through online learning, aiming to reduce the GPU power consumption without significantly compromising the application performance. We evaluate DRLCap on three NVIDIA and one AMD GPU architectures. Experimental results show that DRLCap improves prior GPU power optimization strategies by a large margin. On average, it reduces the GPU energy consumption by 22% with less than 3% performance slowdown on NVIDIA GPUs. This translates to a 20% improvement in the energy efficiency measured by the energy-delay product (EDP) over the NVIDIA default GPU power management strategy. For the AMD GPU architecture, DRLCap saves energy consumption by 10%, on average, with a 4% percentage loss, and improves energy efficiency by 8%
Guided rewriting and constraint satisfaction for parallel GPU code generation
Graphics Processing Units (GPUs) are notoriously hard to optimise for manually due to their scheduling and memory hierarchies. What is needed are good automatic code generators and optimisers for such parallel hardware. Functional approaches such as Accelerate, Futhark and LIFT leverage a high-level algorithmic Intermediate Representation (IR) to expose parallelism and abstract the implementation details away from the user. However, producing efficient code for a given accelerator remains challenging. Existing code generators depend on the user input to choose a subset of hard-coded optimizations or automated exploration of implementation search space. The former suffers from the lack of extensibility, while the latter is too costly due to the size of the search space. A hybrid approach is needed, where a space of valid implementations is built automatically and explored with the aid of human expertise.
This thesis presents a solution combining user-guided rewriting and automatically generated constraints to produce high-performance code. The first contribution is an automatic tuning technique to find a balance between performance and memory consumption. Leveraging its functional patterns, the LIFT compiler is empowered to infer tuning constraints and limit the search to valid tuning combinations only.
Next, the thesis reframes parallelisation as a constraint satisfaction problem. Parallelisation constraints are extracted automatically from the input expression, and a solver is used to identify valid rewriting. The constraints truncate the search space to valid parallel mappings only by capturing the scheduling restrictions of the GPU in the context of a given program. A synchronisation barrier insertion technique is proposed to prevent data races and improve the efficiency of the generated parallel mappings.
The final contribution of this thesis is the guided rewriting method, where the user encodes a design space of structural transformations using high-level IR nodes called rewrite points. These strongly typed pragmas express macro rewrites and expose design choices as explorable parameters. The thesis proposes a small set of reusable rewrite points to achieve tiling, cache locality, data reuse and memory optimisation.
A comparison with the vendor-provided handwritten kernel ARM Compute Library and the TVM code generator demonstrates the effectiveness of this thesis' contributions. With convolution as a use case, LIFT-generated direct and GEMM-based convolution implementations are shown to perform on par with the state-of-the-art solutions on a mobile GPU. Overall, this thesis demonstrates that a functional IR yields well to user-guided and automatic rewriting for high-performance code generation
ACiS: smart switches with application-level acceleration
Network performance has contributed fundamentally to the growth of supercomputing over the past decades. In parallel, High Performance Computing (HPC) peak performance has depended, first, on ever faster/denser CPUs, and then, just on increasing density alone. As operating frequency, and now feature size, have levelled off, two new approaches are becoming central to achieving higher net performance: configurability and integration. Configurability enables hardware to map to the application, as well as vice versa. Integration enables system components that have generally been single function-e.g., a network to transport data—to have additional functionality, e.g., also to operate on that data. More generally, integration enables compute-everywhere: not just in CPU and accelerator, but also in network and, more specifically, the communication switches.
In this thesis, we propose four novel methods of enhancing HPC performance through Advanced Computing in the Switch (ACiS). More specifically, we propose various flexible and application-aware accelerators that can be embedded into or attached to existing communication switches to improve the performance and scalability of HPC and Machine Learning (ML) applications. We follow a modular design discipline through introducing composable plugins to successively add ACiS capabilities.
In the first work, we propose an inline accelerator to communication switches for user-definable collective operations. MPI collective operations can often be performance killers in HPC applications; we seek to solve this bottleneck by offloading them to reconfigurable hardware within the switch itself. We also introduce a novel mechanism that enables the hardware to support MPI communicators of arbitrary shape and that is scalable to very large systems.
In the second work, we propose a look-aside accelerator for communication switches that is capable of processing packets at line-rate. Functions requiring loops and states are addressed in this method. The proposed in-switch accelerator is based on a RISC-V compatible Coarse Grained Reconfigurable Arrays (CGRAs).
To facilitate usability, we have developed a framework to compile user-provided C/C++ codes to appropriate back-end instructions for configuring the accelerator.
In the third work, we extend ACiS to support fused collectives and the combining of collectives with map operations. We observe that there is an opportunity of fusing communication (collectives) with computation. Since the computation can vary for different applications, ACiS support should be programmable in this method.
In the fourth work, we propose that switches with ACiS support can control and manage the execution of applications, i.e., that the switch be an active device with decision-making capabilities. Switches have a central view of the network; they can collect telemetry information and monitor application behavior and then use this information for control, decision-making, and coordination of nodes.
We evaluate the feasibility of ACiS through extensive RTL-based simulation as well as deployment in an open-access cloud infrastructure. Using this simulation framework, when considering a Graph Convolutional Network (GCN) application as a case study, a speedup of on average 3.4x across five real-world datasets is achieved on 24 nodes compared to a CPU cluster without ACiS capabilities
Approximate Computing Survey, Part I: Terminology and Software & Hardware Approximation Techniques
The rapid growth of demanding applications in domains applying multimedia
processing and machine learning has marked a new era for edge and cloud
computing. These applications involve massive data and compute-intensive tasks,
and thus, typical computing paradigms in embedded systems and data centers are
stressed to meet the worldwide demand for high performance. Concurrently, the
landscape of the semiconductor field in the last 15 years has constituted power
as a first-class design concern. As a result, the community of computing
systems is forced to find alternative design approaches to facilitate
high-performance and/or power-efficient computing. Among the examined
solutions, Approximate Computing has attracted an ever-increasing interest,
with research works applying approximations across the entire traditional
computing stack, i.e., at software, hardware, and architectural levels. Over
the last decade, there is a plethora of approximation techniques in software
(programs, frameworks, compilers, runtimes, languages), hardware (circuits,
accelerators), and architectures (processors, memories). The current article is
Part I of our comprehensive survey on Approximate Computing, and it reviews its
motivation, terminology and principles, as well it classifies and presents the
technical details of the state-of-the-art software and hardware approximation
techniques.Comment: Under Review at ACM Computing Survey
A parsimonious, computationally efficient machine learning method for spatial regression
We introduce the modified planar rotator method (MPRS), a physically inspired
machine learning method for spatial/temporal regression. MPRS is a
non-parametric model which incorporates spatial or temporal correlations via
short-range, distance-dependent ``interactions'' without assuming a specific
form for the underlying probability distribution. Predictions are obtained by
means of a fully autonomous learning algorithm which employs equilibrium
conditional Monte Carlo simulations. MPRS is able to handle scattered data and
arbitrary spatial dimensions. We report tests on various synthetic and
real-word data in one, two and three dimensions which demonstrate that the MPRS
prediction performance (without parameter tuning) is competitive with standard
interpolation methods such as ordinary kriging and inverse distance weighting.
In particular, MPRS is a particularly effective gap-filling method for rough
and non-Gaussian data (e.g., daily precipitation time series). MPRS shows
superior computational efficiency and scalability for large samples. Massive
data sets involving millions of nodes can be processed in a few seconds on a
standard personal computer.Comment: 42 pages, 15 figure
Efficient Intra-Rack Resource Disaggregation for HPC Using Co-Packaged DWDM Photonics
The diversity of workload requirements and increasing hardware heterogeneity
in emerging high performance computing (HPC) systems motivate resource
disaggregation. Resource disaggregation allows compute and memory resources to
be allocated individually as required to each workload. However, it is unclear
how to efficiently realize this capability and cost-effectively meet the
stringent bandwidth and latency requirements of HPC applications. To that end,
we describe how modern photonics can be co-designed with modern HPC racks to
implement flexible intra-rack resource disaggregation and fully meet the bit
error rate (BER) and high escape bandwidth of all chip types in modern HPC
racks. Our photonic-based disaggregated rack provides an average application
speedup of 11% (46% maximum) for 25 CPU and 61% for 24 GPU benchmarks compared
to a similar system that instead uses modern electronic switches for
disaggregation. Using observed resource usage from a production system, we
estimate that an iso-performance intra-rack disaggregated HPC system using
photonics would require 4x fewer memory modules and 2x fewer NICs than a
non-disaggregated baseline.Comment: 15 pages, 12 figures, 4 tables. Published in IEEE Cluster 202
Approximate Computing Survey, Part II: Application-Specific & Architectural Approximation Techniques and Applications
The challenging deployment of compute-intensive applications from domains
such Artificial Intelligence (AI) and Digital Signal Processing (DSP), forces
the community of computing systems to explore new design approaches.
Approximate Computing appears as an emerging solution, allowing to tune the
quality of results in the design of a system in order to improve the energy
efficiency and/or performance. This radical paradigm shift has attracted
interest from both academia and industry, resulting in significant research on
approximation techniques and methodologies at different design layers (from
system down to integrated circuits). Motivated by the wide appeal of
Approximate Computing over the last 10 years, we conduct a two-part survey to
cover key aspects (e.g., terminology and applications) and review the
state-of-the art approximation techniques from all layers of the traditional
computing stack. In Part II of our survey, we classify and present the
technical details of application-specific and architectural approximation
techniques, which both target the design of resource-efficient
processors/accelerators & systems. Moreover, we present a detailed analysis of
the application spectrum of Approximate Computing and discuss open challenges
and future directions.Comment: Under Review at ACM Computing Survey
Optimization for Deep Learning Systems Applied to Computer Vision
149 p.Since the DL revolution and especially over the last years (2010-2022), DNNs have become an essentialpart of the CV field, and they are present in all its sub-fields (video-surveillance, industrialmanufacturing, autonomous driving, ...) and in almost every new state-of-the-art application that isdeveloped. However, DNNs are very complex and the architecture needs to be carefully selected andadapted in order to maximize its efficiency. In many cases, networks are not specifically designed for theconsidered use case, they are simply recycled from other applications and slightly adapted, without takinginto account the particularities of the use case or the interaction with the rest of the system components,which usually results in a performance drop.This research work aims at providing knowledge and tools for the optimization of systems based on DeepLearning applied to different real use cases within the field of Computer Vision, in order to maximizetheir effectiveness and efficiency
Lattice Boltzmann Methods for Partial Differential Equations
Lattice Boltzmann methods provide a robust and highly scalable numerical technique in modern computational fluid dynamics. Besides the discretization procedure, the relaxation principles form the basis of any lattice Boltzmann scheme and render the method a bottom-up approach, which obstructs its development for approximating broad classes of partial differential equations. This work introduces a novel coherent mathematical path to jointly approach the topics of constructability, stability, and limit consistency for lattice Boltzmann methods. A new constructive ansatz for lattice Boltzmann equations is introduced, which highlights the concept of relaxation in a top-down procedure starting at the targeted partial differential equation. Modular convergence proofs are used at each step to identify the key ingredients of relaxation frequencies, equilibria, and moment bases in the ansatz, which determine linear and nonlinear stability as well as consistency orders of relaxation and space-time discretization. For the latter, conventional techniques are employed and extended to determine the impact of the kinetic limit at the very foundation of lattice Boltzmann methods. To computationally analyze nonlinear stability, extensive numerical tests are enabled by combining the intrinsic parallelizability of lattice Boltzmann methods with the platform-agnostic and scalable open-source framework OpenLB. Through upscaling the number and quality of computations, large variations in the parameter spaces of classical benchmark problems are considered for the exploratory indication of methodological insights. Finally, the introduced mathematical and computational techniques are applied for the proposal and analysis of new lattice Boltzmann methods. Based on stabilized relaxation, limit consistent discretizations, and consistent temporal filters, novel numerical schemes are developed for approximating initial value problems and initial boundary value problems as well as coupled systems thereof. In particular, lattice Boltzmann methods are proposed and analyzed for temporal large eddy simulation, for simulating homogenized nonstationary fluid flow through porous media, for binary fluid flow simulations with higher order free energy models, and for the combination with Monte Carlo sampling to approximate statistical solutions of the incompressible Euler equations in three dimensions
A Quantum Optimization Case Study for a Transport Robot Scheduling Problem
We present a comprehensive case study comparing the performance of D-Waves'
quantum-classical hybrid framework, Fujitsu's quantum-inspired digital
annealer, and Gurobi's state-of-the-art classical solver in solving a transport
robot scheduling problem. This problem originates from an industrially relevant
real-world scenario. We provide three different models for our problem
following different design philosophies. In our benchmark, we focus on the
solution quality and end-to-end runtime of different model and solver
combinations. We find promising results for the digital annealer and some
opportunities for the hybrid quantum annealer in direct comparison with Gurobi.
Our study provides insights into the workflow for solving an
application-oriented optimization problem with different strategies, and can be
useful for evaluating the strengths and weaknesses of different approaches
- …