3,185 research outputs found
SPH-EXA: Enhancing the Scalability of SPH codes Via an Exascale-Ready SPH Mini-App
Numerical simulations of fluids in astrophysics and computational fluid
dynamics (CFD) are among the most computationally-demanding calculations, in
terms of sustained floating-point operations per second, or FLOP/s. It is
expected that these numerical simulations will significantly benefit from the
future Exascale computing infrastructures, that will perform 10^18 FLOP/s. The
performance of the SPH codes is, in general, adversely impacted by several
factors, such as multiple time-stepping, long-range interactions, and/or
boundary conditions. In this work an extensive study of three SPH
implementations SPHYNX, ChaNGa, and XXX is performed, to gain insights and to
expose any limitations and characteristics of the codes. These codes are the
starting point of an interdisciplinary co-design project, SPH-EXA, for the
development of an Exascale-ready SPH mini-app. We implemented a rotating square
patch as a joint test simulation for the three SPH codes and analyzed their
performance on a modern HPC system, Piz Daint. The performance profiling and
scalability analysis conducted on the three parent codes allowed to expose
their performance issues, such as load imbalance, both in MPI and OpenMP.
Two-level load balancing has been successfully applied to SPHYNX to overcome
its load imbalance. The performance analysis shapes and drives the design of
the SPH-EXA mini-app towards the use of efficient parallelization methods,
fault-tolerance mechanisms, and load balancing approaches.Comment: arXiv admin note: substantial text overlap with arXiv:1809.0801
Decoupled Model Schedule for Deep Learning Training
Recent years have seen an increase in the development of large deep learning
(DL) models, which makes training efficiency crucial. Common practice is
struggling with the trade-off between usability and performance. On one hand,
DL frameworks such as PyTorch use dynamic graphs to facilitate model developers
at a price of sub-optimal model training performance. On the other hand,
practitioners propose various approaches to improving the training efficiency
by sacrificing some of the flexibility, ranging from making the graph static
for more thorough optimization (e.g., XLA) to customizing optimization towards
large-scale distributed training (e.g., DeepSpeed and Megatron-LM).
In this paper, we aim to address the tension between usability and training
efficiency through separation of concerns. Inspired by DL compilers that
decouple the platform-specific optimizations of a tensor-level operator from
its arithmetic definition, this paper proposes a schedule language to decouple
model execution from definition. Specifically, the schedule works on a PyTorch
model and uses a set of schedule primitives to convert the model for common
model training optimizations such as high-performance kernels, effective 3D
parallelism, and efficient activation checkpointing. Compared to existing
optimization solutions, we optimize the model as-needed through high-level
primitives, and thus preserving programmability and debuggability for users to
a large extent. Our evaluation results show that by scheduling the existing
hand-crafted optimizations in a systematic way, we are able to improve training
throughput by up to 3.35x on a single machine with 8 NVIDIA V100 GPUs, and by
up to 1.32x on multiple machines with up to 64 GPUs, when compared to the
out-of-the-box performance of DeepSpeed and Megatron-LM
Decompiling x86 Deep Neural Network Executables
Due to their widespread use on heterogeneous hardware devices, deep learning
(DL) models are compiled into executables by DL compilers to fully leverage
low-level hardware primitives. This approach allows DL computations to be
undertaken at low cost across a variety of computing platforms, including CPUs,
GPUs, and various hardware accelerators.
We present BTD (Bin to DNN), a decompiler for deep neural network (DNN)
executables. BTD takes DNN executables and outputs full model specifications,
including types of DNN operators, network topology, dimensions, and parameters
that are (nearly) identical to those of the input models. BTD delivers a
practical framework to process DNN executables compiled by different DL
compilers and with full optimizations enabled on x86 platforms. It employs
learning-based techniques to infer DNN operators, dynamic analysis to reveal
network architectures, and symbolic execution to facilitate inferring
dimensions and parameters of DNN operators.
Our evaluation reveals that BTD enables accurate recovery of full
specifications of complex DNNs with millions of parameters (e.g., ResNet). The
recovered DNN specifications can be re-compiled into a new DNN executable
exhibiting identical behavior to the input executable. We show that BTD can
boost two representative attacks, adversarial example generation and knowledge
stealing, against DNN executables. We also demonstrate cross-architecture
legacy code reuse using BTD, and envision BTD being used for other critical
downstream tasks like DNN security hardening and patching.Comment: The extended version of a paper to appear in the Proceedings of the
32nd USENIX Security Symposium, 2023, (USENIX Security '23), 25 page
A Compiler and Runtime Infrastructure for Automatic Program Distribution
This paper presents the design and the implementation of a compiler and runtime infrastructure for automatic program distribution. We are building a research infrastructure that enables experimentation with various program partitioning and mapping strategies and the study of automatic distribution's effect on resource consumption (e.g., CPU, memory, communication). Since many optimization techniques are faced with conflicting optimization targets (e.g., memory and communication), we believe that it is important to be able to study their interaction.
We present a set of techniques that enable flexible resource modeling and program distribution. These are: dependence analysis, weighted graph partitioning, code and communication generation, and profiling. We have developed these ideas in the context of the Java language. We present in detail the design and implementation of each of the techniques as part of our compiler and runtime infrastructure. Then, we evaluate our design and present preliminary experimental data for each component, as well as for the entire system
- …