8 research outputs found
Energy-Aware DNN Graph Optimization
Unlike existing work in deep neural network (DNN) graphs optimization for
inference performance, we explore DNN graph optimization for energy awareness
and savings for power- and resource-constrained machine learning devices. We
present a method that allows users to optimize energy consumption or balance
between energy and inference performance for DNN graphs. This method
efficiently searches through the space of equivalent graphs, and identifies a
graph and the corresponding algorithms that incur the least cost in execution.
We implement the method and evaluate it with multiple DNN models on a GPU-based
machine. Results show that our method achieves significant energy savings,
i.e., 24% with negligible performance impact
A Learned Performance Model for Tensor Processing Units
Accurate hardware performance models are critical to efficient code
generation. They can be used by compilers to make heuristic decisions, by
superoptimizers as a minimization objective, or by autotuners to find an
optimal configuration for a specific program. However, they are difficult to
develop because contemporary processors are complex, and the recent
proliferation of deep learning accelerators has increased the development
burden. We demonstrate a method of learning performance models from a corpus of
tensor computation graph programs for Tensor Processing Unit (TPU) instances.
We show that our learned model outperforms a heavily-optimized analytical
performance model on two tasks -- tile-size selection and operator fusion --
and that it helps an autotuner discover faster programs in a setting where
access to TPUs is limited or expensive.Comment: A version will appear in the Proceedings of the 4th MLSys Conference,
San Jose, CA, USA, 202
Database Meets Deep Learning: Challenges and Opportunities
Deep learning has recently become very popular on account of its incredible
success in many complex data-driven applications, such as image classification
and speech recognition. The database community has worked on data-driven
applications for many years, and therefore should be playing a lead role in
supporting this new wave. However, databases and deep learning are different in
terms of both techniques and applications. In this paper, we discuss research
problems at the intersection of the two fields. In particular, we discuss
possible improvements for deep learning systems from a database perspective,
and analyze database applications that may benefit from deep learning
techniques.Comment: The first version of this paper has appeared in SIGMOD Record. In
this (third) version, we extend it to include the recent developments in this
field and references to recent work (especially for section 3.2 and section
4.2
Autonomous Navigation via Deep Reinforcement Learning for Resource Constraint Edge Nodes using Transfer Learning
Smart and agile drones are fast becoming ubiquitous at the edge of the cloud.
The usage of these drones are constrained by their limited power and compute
capability. In this paper, we present a Transfer Learning (TL) based approach
to reduce on-board computation required to train a deep neural network for
autonomous navigation via Deep Reinforcement Learning for a target algorithmic
performance. A library of 3D realistic meta-environments is manually designed
using Unreal Gaming Engine and the network is trained end-to-end. These trained
meta-weights are then used as initializers to the network in a test environment
and fine-tuned for the last few fully connected layers. Variation in drone
dynamics and environmental characteristics is carried out to show robustness of
the approach. Using NVIDIA GPU profiler it was shown that the energy
consumption and training latency is reduced by 3.7x and 1.8x respectively
without significant degradation in the performance in terms of average distance
traveled before crash i.e. Mean Safe Flight (MSF). The approach is also tested
on a real environment using DJI Tello drone and similar results were reported
IOS: Inter-Operator Scheduler for CNN Acceleration
To accelerate CNN inference, existing deep learning frameworks focus on
optimizing intra-operator parallelization. However, a single operator can no
longer fully utilize the available parallelism given the rapid advances in
high-performance hardware, resulting in a large gap between the peak
performance and the real performance. This performance gap is more severe under
smaller batch sizes. In this work, we extensively study the parallelism between
operators and propose Inter-Operator Scheduler (IOS) to automatically schedule
multiple operators' parallel execution through a novel dynamic programming
algorithm. IOS consistently outperforms state-of-the-art libraries (e.g.,
TensorRT) by 1.1 to 1.5x on modern CNN benchmarks. The code to reproduce each
experiment is available at:
https://github.com/mit-han-lab/inter-operator-scheduler.Comment: Accepted by MLSys 202
Ordering Chaos: Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
Recent advances demonstrate that irregularly wired neural networks from
Neural Architecture Search (NAS) and Random Wiring can not only automate the
design of deep neural networks but also emit models that outperform previous
manual designs. These designs are especially effective while designing neural
architectures under hard resource constraints (memory, MACs, . . . ) which
highlights the importance of this class of designing neural networks. However,
such a move creates complication in the previously streamlined pattern of
execution. In fact one of the main challenges is that the order of such nodes
in the neural network significantly effects the memory footprint of the
intermediate activations. Current compilers do not schedule with regard to
activation memory footprint that it significantly increases its peak compared
to the optimum, rendering it not applicable for edge devices. To address this
standing issue, we present a memory-aware compiler, dubbed SERENITY, that
utilizes dynamic programming to find a sequence that finds a schedule with
optimal memory footprint. Our solution also comprises of graph rewriting
technique that allows further reduction beyond the optimum. As such, SERENITY
achieves optimal peak memory, and the graph rewriting technique further
improves this resulting in 1.68x improvement with dynamic programming-based
scheduler and 1.86x with graph rewriting, against TensorFlow Lite with less
than one minute overhead.Comment: Published as a conference paper at MLSys 2020 (Oral Presentation
Data Movement Is All You Need: A Case Study on Optimizing Transformers
Transformers are one of the most important machine learning workloads today.
Training one is a very compute-intensive task, often taking days or weeks, and
significant attention has been given to optimizing transformers. Despite this,
existing implementations do not efficiently utilize GPUs. We find that data
movement is the key bottleneck when training. Due to Amdahl's Law and massive
improvements in compute performance, training has now become memory-bound.
Further, existing frameworks use suboptimal data layouts. Using these insights,
we present a recipe for globally optimizing data movement in transformers. We
reduce data movement by up to 22.91% and overall achieve a 1.30x performance
improvement over state-of-the-art frameworks when training a BERT encoder layer
and 1.19x for the entire BERT. Our approach is applicable more broadly to
optimizing deep neural networks, and offers insight into how to tackle emerging
performance bottlenecks.Comment: 22 pages, 8 figures; MLSys 2021 camera read
Daydream: Accurately Estimating the Efficacy of Optimizations for DNN Training
Modern deep neural network (DNN) training jobs use complex and heterogeneous
software/hardware stacks. The efficacy of software-level optimizations can vary
significantly when used in different deployment configurations. It is onerous
and error-prone for ML practitioners and system developers to implement each
optimization separately, and determine which ones will improve performance in
their own configurations. Unfortunately, existing profiling tools do not aim to
answer predictive questions such as "How will optimization X affect the
performance of my model?". We address this critical limitation, and proposes a
new profiling tool, Daydream, to help programmers efficiently explore the
efficacy of DNN optimizations. Daydream models DNN execution with a
fine-grained dependency graph based on low-level traces collected by CUPTI, and
predicts runtime by simulating execution based on the dependency graph.
Daydream maps the low-level traces using DNN domain-specific knowledge, and
introduces a set of graph-transformation primitives that can easily model a
wide variety of optimizations. We show that Daydream is able to model most
mainstream DNN optimization techniques, and accurately predict the efficacy of
optimizations that will result in significant performance improvements