42 research outputs found
Instead of Rewriting Foreign Code for Machine Learning, Automatically Synthesize Fast Gradients
Applying differentiable programming techniques and machine learning
algorithms to foreign programs requires developers to either rewrite their code
in a machine learning framework, or otherwise provide derivatives of the
foreign code. This paper presents Enzyme, a high-performance automatic
differentiation (AD) compiler plugin for the LLVM compiler framework capable of
synthesizing gradients of statically analyzable programs expressed in the LLVM
intermediate representation (IR). Enzyme synthesizes gradients for programs
written in any language whose compiler targets LLVM IR including C, C++,
Fortran, Julia, Rust, Swift, MLIR, etc., thereby providing native AD
capabilities in these languages. Unlike traditional source-to-source and
operator-overloading tools, Enzyme performs AD on optimized IR. On a
machine-learning focused benchmark suite including Microsoft's ADBench, AD on
optimized IR achieves a geometric mean speedup of 4.5x over AD on IR before
optimization allowing Enzyme to achieve state-of-the-art performance. Packaging
Enzyme for PyTorch and TensorFlow provides convenient access to gradients of
foreign code with state-of-the art performance, enabling foreign code to be
directly incorporated into existing machine learning workflows.Comment: To be published in NeurIPS 202
Productivity meets Performance: Julia on A64FX
The Fujitsu A64FX ARM-based processor is used in supercomputers such as Fugaku in Japan and Isambard 2 in the UK and provides an interesting combination of hardware features such as Scalable Vector Extension (SVE), and native support for reduced-precision floating-point arithmetic. The goal of this paper is to explore performance of the Julia programming language on the A64FX processor, with a particular focus on reduced precision. Here, we present a performance study on axpy to verify the compilation pipeline, demonstrating that Julia can match the performance of tuned libraries. Additionally, we investigate Message Passing Interface (MPI) scalability and throughput analysis on Fugaku showing next to no significant overheads of Julia of its MPI interface. To explore the usability of Julia to target various floating-point precisions, we present results of ShallowWaters.jl, a shallow water model that can be executed a various levels of precision. Even for such complex applications, Julia's type-flexible programming paradigm offers both, productivity and performance
Dynamic automatic differentiation of GPU broadcast kernels
We show how forward-mode automatic differentiation (AD) can be employed within larger reverse-mode computations to dynamically differentiate broadcast operations in a GPU-friendly manner. Our technique fully exploits the broadcast Jacobian's inherent sparsity structure, and unlike a pure reverse-mode approach, this "mixed-mode" approach does not require a backwards pass over the broadcasted operation's subgraph, obviating the need for several reverse-mode-specific programmability restrictions on user-authored broadcast operations. Most notably, this approach allows broadcast fusion in primal code despite the presence of data-dependent control flow. We discuss an experiment in which a Julia implementation of our technique outperformed pure reverse-mode TensorFlow and Julia implementations for differentiating through broadcast operations within an HM-LSTM cell update calculation
Bring the BitCODE -- Moving Compute and Data in Distributed Heterogeneous Systems
In this paper, we present a framework for moving compute and data between
processing elements in a distributed heterogeneous system. The implementation
of the framework is based on the LLVM compiler toolchain combined with the UCX
communication framework. The framework can generate binary machine code or LLVM
bitcode for multiple CPU architectures and move the code to remote machines
while dynamically optimizing and linking the code on the target platform. The
remotely injected code can recursively propagate itself to other remote
machines or generate new code. The goal of this paper is threefold: (a) to
present an architecture and implementation of the framework that provides
essential infrastructure to program a new class of disaggregated systems
wherein heterogeneous programming elements such as compute nodes and data
processing units (DPUs) are distributed across the system, (b) to demonstrate
how the framework can be integrated with modern, high-level programming
languages such as Julia, and (c) to demonstrate and evaluate a new class of
eXtended Remote Direct Memory Access (X-RDMA) communication operations that are
enabled by this framework. To evaluate the capabilities of the framework, we
used a cluster with Fujitsu CPUs and heterogeneous cluster with Intel CPUs and
BlueField-2 DPUs interconnected using high-performance RDMA fabric. We
demonstrated an X-RDMA pointer chase application that outperforms an RDMA
GET-based implementation by 70% and is as fast as Active Messages, but does not
require function predeployment on remote platforms.Comment: 11 pages, 12 figures, to be published in IEEE CLUSTER 202
Batched Second-Order Adjoint Sensitivity for Reduced Space Methods
This paper presents an efficient method for extracting the second-order
sensitivities from a system of implicit nonlinear equations on upcoming
graphical processing units (GPU) dominated computer systems. We design a custom
automatic differentiation (AutoDiff) backend that targets highly parallel
architectures by extracting the second-order information in batch. When the
nonlinear equations are associated to a reduced space optimization problem, we
leverage the parallel reverse-mode accumulation in a batched adjoint-adjoint
algorithm to compute efficiently the reduced Hessian of the problem. We apply
the method to extract the reduced Hessian associated to the balance equations
of a power network, and show on the largest instances that a parallel GPU
implementation is 30 times faster than a sequential CPU reference based on
UMFPACK.Comment: SIAM-PP2
Automated Translation and Accelerated Solving of Differential Equations on Multiple GPU Platforms
We demonstrate a high-performance vendor-agnostic method for massively
parallel solving of ensembles of ordinary differential equations (ODEs) and
stochastic differential equations (SDEs) on GPUs. The method is integrated with
a widely used differential equation solver library in a high-level language
(Julia's DifferentialEquations.jl) and enables GPU acceleration without
requiring code changes by the user. Our approach achieves state-of-the-art
performance compared to hand-optimized CUDA-C++ kernels, while performing
faster than the vectorized-map (\texttt{vmap}) approach
implemented in JAX and PyTorch. Performance evaluation on NVIDIA, AMD, Intel,
and Apple GPUs demonstrates performance portability and vendor-agnosticism. We
show composability with MPI to enable distributed multi-GPU workflows. The
implemented solvers are fully featured, supporting event handling, automatic
differentiation, and incorporating of datasets via the GPU's texture memory,
allowing scientists to take advantage of GPU acceleration on all major current
architectures without changing their model code and without loss of
performance.Comment: 11 figure
Oceananigans.jl: A model that achieves breakthrough resolution, memory and energy efficiency in global ocean simulations
Climate models must simulate hundreds of future scenarios for hundreds of
years at coarse resolutions, and a handful of high-resolution decadal
simulations to resolve localized extreme events. Using Oceananigans.jl, written
from scratch in Julia, we report several achievements: First, a global ocean
simulation with breakthrough horizontal resolution -- 488m -- reaching 15
simulated days per day (0.04 simulated years per day; SYPD). Second,
Oceananigans simulates the global ocean at 488m with breakthrough memory
efficiency on just 768 Nvidia A100 GPUs, a fraction of the resources available
on current and upcoming exascale supercomputers. Third, and arguably most
significant for climate modeling, Oceananigans achieves breakthrough energy
efficiency reaching 0.95 SYPD at 1.7 km on 576 A100s and 9.9 SYPD at 10 km on
68 A100s -- the latter representing the highest horizontal resolutions employed
by current IPCC-class ocean models. Routine climate simulations with 10 km
ocean components are within reach
Transparent distributed programming in Julia
Thesis: S.M., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2019Cataloged from PDF version of thesis.Includes bibliographical references (pages 39-44).Scientific and engineering problems grow ever larger and more challenging, solving them requires taking advantage of domain expertise and modern compute capabilities. This encourages efficient usage of GPUs and using large scale cluster environments efficiently. Domain experts should not need to acquire the deep knowledge required to develop applications that scale, but rather should be able to express data science and engineering problems in terms of vectorized operations and linear algebra, that is in language inherent to the field. The approach introduced here, gives performance engineers access to low-level capabilities of the hardware, allowing them to collaborate with domain experts in the same language. This removes the need to rewrite scientific code in a low-level language, speeding up the iteration cycle and allowing for rapid prototyping. We investigate composable, layered abstractions for scientific computing. They separate the user intent, the what, from the how of the implementation and the where of the execution. The focus is on the distributed aspects, how array abstractions for distributed and accelerated computing can compose with each other and how we can provide access to low-level capabilities in a transparent fashion. Building and debugging these abstractions is challenging. This work introduces Cthulhu, a unique debugging tool for abstractions, that takes into consideration the dynamic execution model and the static compilation process of Julia."This research is supported in part by NSF DMS-1312831, NSF OAC-1835443, Darpa XDATA, and an ARAMCO MITEI grant"by Valentin Churavy.S.M.S.M. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Scienc