2,228 research outputs found
The Memory Perturbation Equation: Understanding Model's Sensitivity to Data
Understanding model's sensitivity to its training data is crucial but can
also be challenging and costly, especially during training. To simplify such
issues, we present the Memory-Perturbation Equation (MPE) which relates model's
sensitivity to perturbation in its training data. Derived using Bayesian
principles, the MPE unifies existing sensitivity measures, generalizes them to
a wide-variety of models and algorithms, and unravels useful properties
regarding sensitivities. Our empirical results show that sensitivity estimates
obtained during training can be used to faithfully predict generalization on
unseen test data. The proposed equation is expected to be useful for future
research on robust and adaptive learning.Comment: 37th Conference on Neural Information Processing Systems (NeurIPS
2023
Twelve times faster yet accurate: a new stateâofâtheâart in radiation schemes via performance and spectral optimization
Radiation schemes are critical components of Earth system models that need to be both efficient and accurate. Despite the use of approximations such as 1D radiative transfer, radiation can account for a large share of the runtime of expensive climate simulations. Here we seek a new stateâofâtheâart in speed and accuracy by combining code optimization with improved algorithms. To fully benefit from new spectrally reduced gas optics schemes, we restructure code to avoid short vectorized loops where possible by collapsing the spectral and vertical dimensions. Our main focus is the ecRad radiation scheme, where this requires batching of adjacent cloudy layers, trading some simplicity for improved vectorization and instructionâlevel parallelism. When combined with common optimization techniques for serial code and porting widely used twoâstream kernels fully to single precision, we find that ecRad with the TripleClouds solver becomes 12 times faster than the operational radiation scheme in ECMWF's Integrated Forecast System (IFS) cycle 47r3, which uses a less accurate gas optics model (RRMTG) and a more noisy solver (McICA). After applying the spectral reduction and extensive optimizations to the more sophisticated SPARTACUS solver, we find that itâs 2.5 times faster than IFS cy47r3 radiation, making cloud 3D radiative effects affordable to compute in largeâscale models. The code optimization itself gave a threefold speedup for both solvers. While SPARTACUS is still under development, preliminary experiments show slightly improved mediumârange forecasts of 2âm temperature in the tropics, and in yearâlong coupled atmosphereâocean simulations the 3D effects warm the surface substantially
Backpropagation Beyond the Gradient
Automatic differentiation is a key enabler of deep learning: previously, practitioners were limited to models
for which they could manually compute derivatives. Now, they can create sophisticated models with almost
no restrictions and train them using first-order, i. e. gradient, information. Popular libraries like PyTorch
and TensorFlow compute this gradient efficiently, automatically, and conveniently with a single line of
code. Under the hood, reverse-mode automatic differentiation, or gradient backpropagation, powers the
gradient computation in these libraries. Their entire design centers around gradient backpropagation.
These frameworks are specialized around one specific taskâcomputing the average gradient in a mini-batch.
This specialization often complicates the extraction of other information like higher-order statistical moments
of the gradient, or higher-order derivatives like the Hessian. It limits practitioners and researchers to methods
that rely on the gradient. Arguably, this hampers the field from exploring the potential of higher-order
information and there is evidence that focusing solely on the gradient has not lead to significant recent
advances in deep learning optimization.
To advance algorithmic research and inspire novel ideas, information beyond the batch-averaged gradient
must be made available at the same level of computational efficiency, automation, and convenience.
This thesis presents approaches to simplify experimentation with rich information beyond the gradient
by making it more readily accessible. We present an implementation of these ideas as an extension to the
backpropagation procedure in PyTorch. Using this newly accessible information, we demonstrate possible use
cases by (i) showing how it can inform our understanding of neural network training by building a diagnostic
tool, and (ii) enabling novel methods to efficiently compute and approximate curvature information.
First, we extend gradient backpropagation for sequential feedforward models to Hessian backpropagation
which enables computing approximate per-layer curvature. This perspective unifies recently proposed block-
diagonal curvature approximations. Like gradient backpropagation, the computation of these second-order
derivatives is modular, and therefore simple to automate and extend to new operations.
Based on the insight that rich information beyond the gradient can be computed efficiently and at the
same time, we extend the backpropagation in PyTorch with the BackPACK library. It provides efficient and
convenient access to statistical moments of the gradient and approximate curvature information, often at a
small overhead compared to computing just the gradient.
Next, we showcase the utility of such information to better understand neural network training. We build
the Cockpit library that visualizes what is happening inside the model during training through various
instruments that rely on BackPACKâs statistics. We show how Cockpit provides a meaningful statistical
summary report to the deep learning engineer to identify bugs in their machine learning pipeline, guide
hyperparameter tuning, and study deep learning phenomena.
Finally, we use BackPACKâs extended automatic differentiation functionality to develop ViViT, an approach
to efficiently compute curvature information, in particular curvature noise. It uses the low-rank structure
of the generalized Gauss-Newton approximation to the Hessian and addresses shortcomings in existing
curvature approximations. Through monitoring curvature noise, we demonstrate how ViViTâs information
helps in understanding challenges to make second-order optimization methods work in practice.
This work develops new tools to experiment more easily with higher-order information in complex deep
learning models. These tools have impacted works on Bayesian applications with Laplace approximations,
out-of-distribution generalization, differential privacy, and the design of automatic differentia-
tion systems. They constitute one important step towards developing and establishing more efficient deep
learning algorithms
Architecture and Circuit Design Optimization for Compute-In-Memory
The objective of the proposed research is to optimize computing-in-memory (CIM) design for accelerating Deep Neural Network (DNN) algorithms. As compute peripheries such as analog-to-digital converter (ADC) introduce significant overhead in CIM inference design, the research first focuses on the circuit optimization for inference acceleration and proposes a resistive random access memory (RRAM) based ADC-free in-memory compute scheme. We comprehensively explore the trade-offs involving different types of ADCs and investigate a new ADC design especially suited for the CIM, which performs the analog shift-add for multiple weight significance bits, improving the throughput and energy efficiency under similar area constraints. Furthermore, we prototype an ADC-free CIM inference chip design with a fully-analog data processing manner between sub-arrays, which can significantly improve the hardware performance over the conventional CIM designs and achieve near-software classification accuracy on ImageNet and CIFAR-10/-100 dataset. Secondly, the research focuses on hardware support for CIM on-chip training. To maximize hardware reuse of CIM weight stationary dataflow, we propose the CIM training architectures with the transpose weight mapping strategy. The cell design and periphery circuitry are modified to efficiently support bi-directional compute. A novel solution of signed number multiplication is also proposed to handle the negative input in backpropagation. Finally, we propose an SRAM-based CIM training architecture and comprehensively explore the system-level hardware performance for DNN on-chip training based on silicon measurement results.Ph.D
Parameterizing Vertical Mixing Coefficients in the Ocean Surface Boundary Layer using Neural Networks
Vertical mixing parameterizations in ocean models are formulated on the basis
of the physical principles that govern turbulent mixing. However, many
parameterizations include ad hoc components that are not well constrained by
theory or data. One such component is the eddy diffusivity model, where
vertical turbulent fluxes of a quantity are parameterized from a variable eddy
diffusion coefficient and the mean vertical gradient of the quantity. In this
work, we improve a parameterization of vertical mixing in the ocean surface
boundary layer by enhancing its eddy diffusivity model using data-driven
methods, specifically neural networks. The neural networks are designed to take
extrinsic and intrinsic forcing parameters as input to predict the eddy
diffusivity profile and are trained using output data from a second moment
closure turbulent mixing scheme. The modified vertical mixing scheme predicts
the eddy diffusivity profile through online inference of neural networks and
maintains the conservation principles of the standard ocean model equations,
which is particularly important for its targeted use in climate simulations. We
describe the development and stable implementation of neural networks in an
ocean general circulation model and demonstrate that the enhanced scheme
outperforms its predecessor by reducing biases in the mixed-layer depth and
upper ocean stratification. Our results demonstrate the potential for
data-driven physics-aware parameterizations to improve global climate models
GHOST: A Graph Neural Network Accelerator using Silicon Photonics
Graph neural networks (GNNs) have emerged as a powerful approach for
modelling and learning from graph-structured data. Multiple fields have since
benefitted enormously from the capabilities of GNNs, such as recommendation
systems, social network analysis, drug discovery, and robotics. However,
accelerating and efficiently processing GNNs require a unique approach that
goes beyond conventional artificial neural network accelerators, due to the
substantial computational and memory requirements of GNNs. The slowdown of
scaling in CMOS platforms also motivates a search for alternative
implementation substrates. In this paper, we present GHOST, the first
silicon-photonic hardware accelerator for GNNs. GHOST efficiently alleviates
the costs associated with both vertex-centric and edge-centric operations. It
implements separately the three main stages involved in running GNNs in the
optical domain, allowing it to be used for the inference of various widely used
GNN models and architectures, such as graph convolution networks and graph
attention networks. Our simulation studies indicate that GHOST exhibits at
least 10.2x better throughput and 3.8x better energy efficiency when compared
to GPU, TPU, CPU and multiple state-of-the-art GNN hardware accelerators
Guided rewriting and constraint satisfaction for parallel GPU code generation
Graphics Processing Units (GPUs) are notoriously hard to optimise for manually due to their scheduling and memory hierarchies. What is needed are good automatic code generators and optimisers for such parallel hardware. Functional approaches such as Accelerate, Futhark and LIFT leverage a high-level algorithmic Intermediate Representation (IR) to expose parallelism and abstract the implementation details away from the user. However, producing efficient code for a given accelerator remains challenging. Existing code generators depend on the user input to choose a subset of hard-coded optimizations or automated exploration of implementation search space. The former suffers from the lack of extensibility, while the latter is too costly due to the size of the search space. A hybrid approach is needed, where a space of valid implementations is built automatically and explored with the aid of human expertise.
This thesis presents a solution combining user-guided rewriting and automatically generated constraints to produce high-performance code. The first contribution is an automatic tuning technique to find a balance between performance and memory consumption. Leveraging its functional patterns, the LIFT compiler is empowered to infer tuning constraints and limit the search to valid tuning combinations only.
Next, the thesis reframes parallelisation as a constraint satisfaction problem. Parallelisation constraints are extracted automatically from the input expression, and a solver is used to identify valid rewriting. The constraints truncate the search space to valid parallel mappings only by capturing the scheduling restrictions of the GPU in the context of a given program. A synchronisation barrier insertion technique is proposed to prevent data races and improve the efficiency of the generated parallel mappings.
The final contribution of this thesis is the guided rewriting method, where the user encodes a design space of structural transformations using high-level IR nodes called rewrite points. These strongly typed pragmas express macro rewrites and expose design choices as explorable parameters. The thesis proposes a small set of reusable rewrite points to achieve tiling, cache locality, data reuse and memory optimisation.
A comparison with the vendor-provided handwritten kernel ARM Compute Library and the TVM code generator demonstrates the effectiveness of this thesis' contributions. With convolution as a use case, LIFT-generated direct and GEMM-based convolution implementations are shown to perform on par with the state-of-the-art solutions on a mobile GPU. Overall, this thesis demonstrates that a functional IR yields well to user-guided and automatic rewriting for high-performance code generation
Utilized System Model Using Channel State Information Network with Gated Recurrent Units (CsiNet-GRUs)
MIMO: multiple-input multiple-output technology uses multiple antennas to use reflected signals to provide channel robustness and throughput gains. It is advantageous in several applications like cellular systems, and users are distributed over a wide coverage area in various applications such as mobile systems, improving channel state information (CSI) processing efficiency in massive MIMO systems. This chapter proposes two channel-based deep learning methods to enhance the performance in a massive MIMO system and compares our proposed technique to the previous methods. The proposed technique is based on the channel state information network combined with the gated recurrent unitâs technique CsiNet-GRUs, which increases recovery efficiency. Besides, a fair balance between compression ratio (CR) and complexity is given using correlation time in training samples. The simulation results show that the proposed CsiNet-GRUs technique fulfills performance improvement compared with the existing literature techniques, namely CS-based methods Conv-LSTM CsiNet, LASSO, Tval3, and CsiNet
FourierPIM: High-Throughput In-Memory Fast Fourier Transform and Polynomial Multiplication
The Discrete Fourier Transform (DFT) is essential for various applications
ranging from signal processing to convolution and polynomial multiplication.
The groundbreaking Fast Fourier Transform (FFT) algorithm reduces DFT time
complexity from the naive O(n^2) to O(n log n), and recent works have sought
further acceleration through parallel architectures such as GPUs.
Unfortunately, accelerators such as GPUs cannot exploit their full computing
capabilities as memory access becomes the bottleneck. Therefore, this paper
accelerates the FFT algorithm using digital Processing-in-Memory (PIM)
architectures that shift computation into the memory by exploiting physical
devices capable of storage and logic (e.g., memristors). We propose an O(log n)
in-memory FFT algorithm that can also be performed in parallel across multiple
arrays for high-throughput batched execution, supporting both fixed-point and
floating-point numbers. Through the convolution theorem, we extend this
algorithm to O(log n) polynomial multiplication - a fundamental task for
applications such as cryptography. We evaluate FourierPIM on a
publicly-available cycle-accurate simulator that verifies both correctness and
performance, and demonstrate 5-15x throughput and 4-13x energy improvement over
the NVIDIA cuFFT library on state-of-the-art GPUs for FFT and polynomial
multiplication
- âŠ