12,569 research outputs found
Scalable Methods for 8-bit Training of Neural Networks
Quantized Neural Networks (QNNs) are often used to improve network efficiency
during the inference phase, i.e. after the network has been trained. Extensive
research in the field suggests many different quantization schemes. Still, the
number of bits required, as well as the best quantization scheme, are yet
unknown. Our theoretical analysis suggests that most of the training process is
robust to substantial precision reduction, and points to only a few specific
operations that require higher precision. Armed with this knowledge, we
quantize the model parameters, activations and layer gradients to 8-bit,
leaving at a higher precision only the final step in the computation of the
weight gradients. Additionally, as QNNs require batch-normalization to be
trained at high precision, we introduce Range Batch-Normalization (BN) which
has significantly higher tolerance to quantization noise and improved
computational complexity. Our simulations show that Range BN is equivalent to
the traditional batch norm if a precise scale adjustment, which can be
approximated analytically, is applied. To the best of the authors' knowledge,
this work is the first to quantize the weights, activations, as well as a
substantial volume of the gradients stream, in all layers (including batch
normalization) to 8-bit while showing state-of-the-art results over the
ImageNet-1K dataset
Mixed Precision Training With 8-bit Floating Point
Reduced precision computation for deep neural networks is one of the key
areas addressing the widening compute gap driven by an exponential growth in
model size. In recent years, deep learning training has largely migrated to
16-bit precision, with significant gains in performance and energy efficiency.
However, attempts to train DNNs at 8-bit precision have met with significant
challenges because of the higher precision and dynamic range requirements of
back-propagation. In this paper, we propose a method to train deep neural
networks using 8-bit floating point representation for weights, activations,
errors, and gradients. In addition to reducing compute precision, we also
reduced the precision requirements for the master copy of weights from 32-bit
to 16-bit. We demonstrate state-of-the-art accuracy across multiple data sets
(imagenet-1K, WMT16) and a broader set of workloads (Resnet-18/34/50, GNMT,
Transformer) than previously reported. We propose an enhanced loss scaling
method to augment the reduced subnormal range of 8-bit floating point for
improved error propagation. We also examine the impact of quantization noise on
generalization and propose a stochastic rounding technique to address gradient
noise. As a result of applying all these techniques, we report slightly higher
validation accuracy compared to full precision baseline
Deep Learning with Limited Numerical Precision
Training of large-scale deep neural networks is often constrained by the
available computational resources. We study the effect of limited precision
data representation and computation on neural network training. Within the
context of low-precision fixed-point computations, we observe the rounding
scheme to play a crucial role in determining the network's behavior during
training. Our results show that deep networks can be trained using only 16-bit
wide fixed-point number representation when using stochastic rounding, and
incur little to no degradation in the classification accuracy. We also
demonstrate an energy-efficient hardware accelerator that implements
low-precision fixed-point arithmetic with stochastic rounding.Comment: 10 pages, 6 figures, 1 tabl
BitSplit-Net: Multi-bit Deep Neural Network with Bitwise Activation Function
Significant computational cost and memory requirements for deep neural
networks (DNNs) make it difficult to utilize DNNs in resource-constrained
environments. Binary neural network (BNN), which uses binary weights and binary
activations, has been gaining interests for its hardware-friendly
characteristics and minimal resource requirement. However, BNN usually suffers
from accuracy degradation. In this paper, we introduce "BitSplit-Net", a neural
network which maintains the hardware-friendly characteristics of BNN while
improving accuracy by using multi-bit precision. In BitSplit-Net, each bit of
multi-bit activations propagates independently throughout the network before
being merged at the end of the network. Thus, each bit path of the BitSplit-Net
resembles BNN and hardware friendly features of BNN, such as bitwise binary
activation function, are preserved in our scheme. We demonstrate that the
BitSplit version of LeNet-5, VGG-9, AlexNet, and ResNet-18 can be trained to
have similar classification accuracy at a lower computational cost compared to
conventional multi-bit networks with low bit precision (<= 4-bit). We further
evaluate BitSplit-Net on GPU with custom CUDA kernel, showing that BitSplit-Net
can achieve better hardware performance in comparison to conventional multi-bit
networks
A Reconfigurable Low Power High Throughput Architecture for Deep Network Training
General purpose computing systems are used for a large variety of
applications. Extensive supports for flexibility in these systems limit their
energy efficiencies. Neural networks, including deep networks, are widely used
for signal processing and pattern recognition applications. In this paper we
propose a multicore architecture for deep neural network based processing.
Memristor crossbars are utilized to provide low power high throughput execution
of neural networks. The system has both training and recognition (evaluation of
new input) capabilities. The proposed system could be used for classification,
dimensionality reduction, feature extraction, and anomaly detection
applications. The system level area and power benefits of the specialized
architecture is compared with the NVIDIA Telsa K20 GPGPU. Our experimental
evaluations show that the proposed architecture can provide up to five orders
of magnitude more energy efficiency over GPGPUs for deep neural network
processing.Comment: 9 page
Value-aware Quantization for Training and Inference of Neural Networks
We propose a novel value-aware quantization which applies aggressively
reduced precision to the majority of data while separately handling a small
amount of large data in high precision, which reduces total quantization errors
under very low precision. We present new techniques to apply the proposed
quantization to training and inference. The experiments show that our method
with 3-bit activations (with 2% of large ones) can give the same training
accuracy as full-precision one while offering significant (41.6% and 53.7%)
reductions in the memory cost of activations in ResNet-152 and Inception-v3
compared with the state-of-the-art method. Our experiments also show that deep
networks such as Inception-v3, ResNet-101 and DenseNet-121 can be quantized for
inference with 4-bit weights and activations (with 1% 16-bit data) within 1%
top-1 accuracy drop
CodeX: Bit-Flexible Encoding for Streaming-based FPGA Acceleration of DNNs
This paper proposes CodeX, an end-to-end framework that facilitates encoding,
bitwidth customization, fine-tuning, and implementation of neural networks on
FPGA platforms. CodeX incorporates nonlinear encoding to the computation flow
of neural networks to save memory. The encoded features demand significantly
lower storage compared to the raw full-precision activation values; therefore,
the execution flow of CodeX hardware engine is completely performed within the
FPGA using on-chip streaming buffers with no access to the off-chip DRAM. We
further propose a fully-automated algorithm inspired by reinforcement learning
which determines the customized encoding bitwidth across network layers. CodeX
full-stack framework comprises of a compiler which takes a high-level Python
description of an arbitrary neural network architecture. The compiler then
instantiates the corresponding elements from CodeX Hardware library for FPGA
implementation. Proof-of-concept evaluations on MNIST, SVHN, and CIFAR-10
datasets demonstrate an average of 4.65x throughput improvement compared to
stand-alone weight encoding. We further compare CodeX with six existing
full-precision DNN accelerators on ImageNet, showing an average of 3.6x and
2.54x improvement in throughput and performance-per-watt, respectively
Mixed Precision Training of Convolutional Neural Networks using Integer Operations
The state-of-the-art (SOTA) for mixed precision training is dominated by
variants of low precision floating point operations, and in particular, FP16
accumulating into FP32 Micikevicius et al. (2017). On the other hand, while a
lot of research has also happened in the domain of low and mixed-precision
Integer training, these works either present results for non-SOTA networks (for
instance only AlexNet for ImageNet-1K), or relatively small datasets (like
CIFAR-10). In this work, we train state-of-the-art visual understanding neural
networks on the ImageNet-1K dataset, with Integer operations on General Purpose
(GP) hardware. In particular, we focus on Integer Fused-Multiply-and-Accumulate
(FMA) operations which take two pairs of INT16 operands and accumulate results
into an INT32 output.We propose a shared exponent representation of tensors and
develop a Dynamic Fixed Point (DFP) scheme suitable for common neural network
operations. The nuances of developing an efficient integer convolution kernel
is examined, including methods to handle overflow of the INT32 accumulator. We
implement CNN training for ResNet-50, GoogLeNet-v1, VGG-16 and AlexNet; and
these networks achieve or exceed SOTA accuracy within the same number of
iterations as their FP32 counterparts without any change in hyper-parameters
and with a 1.8X improvement in end-to-end training throughput. To the best of
our knowledge these results represent the first INT16 training results on GP
hardware for ImageNet-1K dataset using SOTA CNNs and achieve highest reported
accuracy using half-precisionComment: Published as a conference paper at ICLR 201
Hardware-oriented Approximation of Convolutional Neural Networks
High computational complexity hinders the widespread usage of Convolutional
Neural Networks (CNNs), especially in mobile devices. Hardware accelerators are
arguably the most promising approach for reducing both execution time and power
consumption. One of the most important steps in accelerator development is
hardware-oriented model approximation. In this paper we present Ristretto, a
model approximation framework that analyzes a given CNN with respect to
numerical resolution used in representing weights and outputs of convolutional
and fully connected layers. Ristretto can condense models by using fixed point
arithmetic and representation instead of floating point. Moreover, Ristretto
fine-tunes the resulting fixed point network. Given a maximum error tolerance
of 1%, Ristretto can successfully condense CaffeNet and SqueezeNet to 8-bit.
The code for Ristretto is available.Comment: 8 pages, 4 figures, Accepted as a workshop contribution at ICLR 2016.
Updated comparison to other work
Espresso: Efficient Forward Propagation for BCNNs
There are many applications scenarios for which the computational performance
and memory footprint of the prediction phase of Deep Neural Networks (DNNs)
needs to be optimized. Binary Neural Networks (BDNNs) have been shown to be an
effective way of achieving this objective. In this paper, we show how
Convolutional Neural Networks (CNNs) can be implemented using binary
representations. Espresso is a compact, yet powerful library written in C/CUDA
that features all the functionalities required for the forward propagation of
CNNs, in a binary file less than 400KB, without any external dependencies.
Although it is mainly designed to take advantage of massive GPU parallelism,
Espresso also provides an equivalent CPU implementation for CNNs. Espresso
provides special convolutional and dense layers for BCNNs, leveraging
bit-packing and bit-wise computations for efficient execution. These techniques
provide a speed-up of matrix-multiplication routines, and at the same time,
reduce memory usage when storing parameters and activations. We experimentally
show that Espresso is significantly faster than existing implementations of
optimized binary neural networks ( 2 orders of magnitude). Espresso is
released under the Apache 2.0 license and is available at
http://github.com/fpeder/espresso.Comment: 10 pages, 4 figure
- …