168 research outputs found
Memory-efficient array redistribution through portable collective communication
Modern large-scale deep learning workloads highlight the need for parallel
execution across many devices in order to fit model data into hardware
accelerator memories. In these settings, array redistribution may be required
during a computation, but can also become a bottleneck if not done efficiently.
In this paper we address the problem of redistributing multi-dimensional array
data in SPMD computations, the most prevalent form of parallelism in deep
learning. We present a type-directed approach to synthesizing array
redistributions as sequences of MPI-style collective operations. We prove
formally that our synthesized redistributions are memory-efficient and perform
no excessive data transfers. Array redistribution for SPMD computations using
collective operations has also been implemented in the context of the XLA SPMD
partitioner, a production-grade tool for partitioning programs across
accelerator systems. We evaluate our approach against the XLA implementation
and find that our approach delivers a geometric mean speedup of ,
with maximum speedups as a high as , while offering provable memory
guarantees, making our system particularly appealing for large-scale models.Comment: minor errata fixe
BATS: Binary ArchitecTure Search
This paper proposes Binary ArchitecTure Search (BATS), a framework that
drastically reduces the accuracy gap between binary neural networks and their
real-valued counterparts by means of Neural Architecture Search (NAS). We show
that directly applying NAS to the binary domain provides very poor results. To
alleviate this, we describe, to our knowledge, for the first time, the 3 key
ingredients for successfully applying NAS to the binary domain. Specifically,
we (1) introduce and design a novel binary-oriented search space, (2) propose a
new mechanism for controlling and stabilising the resulting searched
topologies, (3) propose and validate a series of new search strategies for
binary networks that lead to faster convergence and lower search times.
Experimental results demonstrate the effectiveness of the proposed approach and
the necessity of searching in the binary space directly. Moreover, (4) we set a
new state-of-the-art for binary neural networks on CIFAR10, CIFAR100 and
ImageNet datasets. Code will be made available
https://github.com/1adrianb/binary-nasComment: accepted to ECCV 202
Experimental comparison of features and classifiers for Android malware detection
National Research Foundation (NRF) Singapor
PartIR: Composing SPMD Partitioning Strategies for Machine Learning
Training of modern large neural networks (NN) requires a combination of
parallelization strategies encompassing data, model, or optimizer sharding.
When strategies increase in complexity, it becomes necessary for partitioning
tools to be 1) expressive, allowing the composition of simpler strategies, and
2) predictable to estimate performance analytically. We present PartIR, our
design for a NN partitioning system. PartIR is focused on an incremental
approach to rewriting and is hardware-and-runtime agnostic. We present a simple
but powerful API for composing sharding strategies and a simulator to validate
them. The process is driven by high-level programmer-issued partitioning
tactics, which can be both manual and automatic. Importantly, the tactics are
specified separately from the model code, making them easy to change. We
evaluate PartIR on several different models to demonstrate its predictability,
expressibility, and ability to reach peak performance.
DAMO: Deep Agile Mask Optimization for Full Chip Scale
Continuous scaling of the VLSI system leaves a great challenge on
manufacturing and optical proximity correction (OPC) is widely applied in
conventional design flow for manufacturability optimization. Traditional
techniques conducted OPC by leveraging a lithography model and suffered from
prohibitive computational overhead, and mostly focused on optimizing a single
clip without addressing how to tackle the full chip. In this paper, we present
DAMO, a high performance and scalable deep learning-enabled OPC system for full
chip scale. It is an end-to-end mask optimization paradigm which contains a
Deep Lithography Simulator (DLS) for lithography modeling and a Deep Mask
Generator (DMG) for mask pattern generation. Moreover, a novel layout splitting
algorithm customized for DAMO is proposed to handle the full chip OPC problem.
Extensive experiments show that DAMO outperforms the state-of-the-art OPC
solutions in both academia and industrial commercial toolkit
Predicting the Propagation of Acoustic Waves using Deep Convolutional Neural Networks
A novel approach for numerically propagating acoustic waves in two-dimensional quiescent media has been developed through a fully convolutional multi-scale neural network. This data-driven method managed to produce accurate results for long simulation times with a database of Lattice Boltzmann temporal simulations of propagating Gaussian Pulses, even in the case of initial conditions unseen during training time, such as the plane wave configuration or the two initial Gaussian pulses of opposed amplitudes. Two different choices of optimization objectives are compared, resulting in an improved prediction accuracy when adding the spatial gradient difference error to the traditional mean squared error loss function. Further accuracy gains are observed when performing an a posteriori correction on the neural network prediction based on the conservation of acoustic energy, indicating the benefit of including physical information in data-driven methods
Constant Velocity Constraints for Self-Supervised Monocular Depth Estimation
We present a new method for self-supervised monocular depth estimation. Contemporary monocular depth estimation methods use a triplet of consecutive video frames to estimate the central depth image. We make the assumption that the ego-centric view progresses linearly in the scene, based on the kinematic and physical properties of the camera. During the training phase, we can exploit this assumption to create a depth estimation for each image in the triplet. We then apply a new geometry constraint that supports novel synthetic views, thus providing a strong supervisory signal. Our contribution is simple to implement, requires no additional trainable parameter, and produces competitive results when compared with other state-of-the-art methods on the popular KITTI corpus
- …