615 research outputs found
vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design
The most widely used machine learning frameworks require users to carefully
tune their memory usage so that the deep neural network (DNN) fits into the
DRAM capacity of a GPU. This restriction hampers a researcher's flexibility to
study different machine learning algorithms, forcing them to either use a less
desirable network architecture or parallelize the processing across multiple
GPUs. We propose a runtime memory manager that virtualizes the memory usage of
DNNs such that both GPU and CPU memory can simultaneously be utilized for
training larger DNNs. Our virtualized DNN (vDNN) reduces the average GPU memory
usage of AlexNet by up to 89%, OverFeat by 91%, and GoogLeNet by 95%, a
significant reduction in memory requirements of DNNs. Similar experiments on
VGG-16, one of the deepest and memory hungry DNNs to date, demonstrate the
memory-efficiency of our proposal. vDNN enables VGG-16 with batch size 256
(requiring 28 GB of memory) to be trained on a single NVIDIA Titan X GPU card
containing 12 GB of memory, with 18% performance loss compared to a
hypothetical, oracular GPU with enough memory to hold the entire DNN.Comment: Published as a conference paper at the 49th IEEE/ACM International
Symposium on Microarchitecture (MICRO-49), 201
An Experimental Study of Reduced-Voltage Operation in Modern FPGAs for Neural Network Acceleration
We empirically evaluate an undervolting technique, i.e., underscaling the
circuit supply voltage below the nominal level, to improve the power-efficiency
of Convolutional Neural Network (CNN) accelerators mapped to Field Programmable
Gate Arrays (FPGAs). Undervolting below a safe voltage level can lead to timing
faults due to excessive circuit latency increase. We evaluate the
reliability-power trade-off for such accelerators. Specifically, we
experimentally study the reduced-voltage operation of multiple components of
real FPGAs, characterize the corresponding reliability behavior of CNN
accelerators, propose techniques to minimize the drawbacks of reduced-voltage
operation, and combine undervolting with architectural CNN optimization
techniques, i.e., quantization and pruning. We investigate the effect of
environmental temperature on the reliability-power trade-off of such
accelerators. We perform experiments on three identical samples of modern
Xilinx ZCU102 FPGA platforms with five state-of-the-art image classification
CNN benchmarks. This approach allows us to study the effects of our
undervolting technique for both software and hardware variability. We achieve
more than 3X power-efficiency (GOPs/W) gain via undervolting. 2.6X of this gain
is the result of eliminating the voltage guardband region, i.e., the safe
voltage region below the nominal level that is set by FPGA vendor to ensure
correct functionality in worst-case environmental and circuit conditions. 43%
of the power-efficiency gain is due to further undervolting below the
guardband, which comes at the cost of accuracy loss in the CNN accelerator. We
evaluate an effective frequency underscaling technique that prevents this
accuracy loss, and find that it reduces the power-efficiency gain from 43% to
25%.Comment: To appear at the DSN 2020 conferenc
HALO 1.0: A Hardware-agnostic Accelerator Orchestration Framework for Enabling Hardware-agnostic Programming with True Performance Portability for Heterogeneous HPC
This paper presents HALO 1.0, an open-ended extensible multi-agent software
framework that implements a set of proposed hardware-agnostic accelerator
orchestration (HALO) principles. HALO implements a novel compute-centric
message passing interface (C^2MPI) specification for enabling the
performance-portable execution of a hardware-agnostic host application across
heterogeneous accelerators. The experiment results of evaluating eight widely
used HPC subroutines based on Intel Xeon E5-2620 CPUs, Intel Arria 10 GX FPGAs,
and NVIDIA GeForce RTX 2080 Ti GPUs show that HALO 1.0 allows for a unified
control flow for host programs to run across all the computing devices with a
consistently top performance portability score, which is up to five orders of
magnitude higher than the OpenCL-based solution.Comment: 21 page
Multi-tier GPU virtualization for deep learning in cloud-edge systems
Accelerator virtualization offers several advantages in the context of cloud-edge computing. Relatively weak user devices can enhance performance when running workloads by accessing virtualized accelerators available on other resources in the cloud-edge continuum. However, cloud-edge systems are heterogeneous, often leading to compatibility issues arising from various hardware and software stacks present in the system. One mechanism to alleviate this issue is using containers for deploying workloads. Containers isolate applications and their dependencies and store them as images that can run on any device. In addition, user devices may move during the course of application execution, and thus mechanisms such as container migration are required to move running workloads from one resource to another in the network. Furthermore, an optimal destination will need to be determined when migrating between virtual accelerators. Scheduling and placement strategies are incorporated to choose the best possible location depending on the workload requirements. This paper presents AVEC , a framework for accelerator virtualization in cloud-edge computing. The AVEC framework enables the offloading of deep learning workloads for inference from weak user devices to computationally more powerful devices in a cloud-edge network. AVEC incorporates a mechanism that efficiently manages and schedules the virtualization of accelerators. It also supports migration between accelerators to enable stateless container migration. The experimental analysis highlights that AVEC can achieve up to 7x speedup by offloading applications to remote resources. Furthermore, AVEC features a low migration downtime that is less than 5 seconds.PostprintPeer reviewe
- …