19,841 research outputs found
Mixed Precision Training
Deep neural networks have enabled progress in a wide variety of applications.
Growing the size of the neural network typically results in improved accuracy.
As model sizes grow, the memory and compute requirements for training these
models also increases. We introduce a technique to train deep neural networks
using half precision floating point numbers. In our technique, weights,
activations and gradients are stored in IEEE half-precision format.
Half-precision floating numbers have limited numerical range compared to
single-precision numbers. We propose two techniques to handle this loss of
information. Firstly, we recommend maintaining a single-precision copy of the
weights that accumulates the gradients after each optimizer step. This
single-precision copy is rounded to half-precision format during training.
Secondly, we propose scaling the loss appropriately to handle the loss of
information with half-precision gradients. We demonstrate that this approach
works for a wide variety of models including convolution neural networks,
recurrent neural networks and generative adversarial networks. This technique
works for large scale models with more than 100 million parameters trained on
large datasets. Using this approach, we can reduce the memory consumption of
deep learning models by nearly 2x. In future processors, we can also expect a
significant computation speedup using half-precision hardware units.Comment: Published as a conference paper at ICLR 201
A Study of BFLOAT16 for Deep Learning Training
This paper presents the first comprehensive empirical study demonstrating the
efficacy of the Brain Floating Point (BFLOAT16) half-precision format for Deep
Learning training across image classification, speech recognition, language
modeling, generative networks and industrial recommendation systems. BFLOAT16
is attractive for Deep Learning training for two reasons: the range of values
it can represent is the same as that of IEEE 754 floating-point format (FP32)
and conversion to/from FP32 is simple. Maintaining the same range as FP32 is
important to ensure that no hyper-parameter tuning is required for convergence;
e.g., IEEE 754 compliant half-precision floating point (FP16) requires
hyper-parameter tuning. In this paper, we discuss the flow of tensors and
various key operations in mixed precision training, and delve into details of
operations, such as the rounding modes for converting FP32 tensors to BFLOAT16.
We have implemented a method to emulate BFLOAT16 operations in Tensorflow,
Caffe2, IntelCaffe, and Neon for our experiments. Our results show that deep
learning training using BFLOAT16 tensors achieves the same state-of-the-art
(SOTA) results across domains as FP32 tensors in the same number of iterations
and with no changes to hyper-parameters
An Optimized Recurrent Unit for Ultra-Low-Power Keyword Spotting
There is growing interest in being able to run neural networks on sensors,
wearables and internet-of-things (IoT) devices. However, the computational
demands of neural networks make them difficult to deploy on
resource-constrained edge devices.
To meet this need, our work introduces a new recurrent unit architecture that
is specifically adapted for on-device low power acoustic event detection (AED).
The proposed architecture is based on the gated recurrent unit (`GRU') but
features optimizations that make it implementable on ultra-low power
micro-controllers such as the Arm Cortex M0+.
Our new architecture, the Embedded Gated Recurrent Unit (eGRU) is
demonstrated to be highly efficient and suitable for short-duration AED and
keyword spotting tasks. A single eGRU cell is 60x faster and 10x smaller than a
GRU cell. Despite its optimizations, eGRU compares well with GRU across tasks
of varying complexities.
The practicality of eGRU is investigated in a wearable acoustic event
detection application. An eGRU model is implemented and tested on the Arm
Cortex M0-based Atmel ATSAMD21E18 processor. The Arm M0+ implementation of the
eGRU model compares favorably with a full precision GRU that is running on a
workstation. The embedded eGRU model achieves a classification accuracy 95.3%,
which is only 2% less than the full precision GRU
Mixed Precision Training With 8-bit Floating Point
Reduced precision computation for deep neural networks is one of the key
areas addressing the widening compute gap driven by an exponential growth in
model size. In recent years, deep learning training has largely migrated to
16-bit precision, with significant gains in performance and energy efficiency.
However, attempts to train DNNs at 8-bit precision have met with significant
challenges because of the higher precision and dynamic range requirements of
back-propagation. In this paper, we propose a method to train deep neural
networks using 8-bit floating point representation for weights, activations,
errors, and gradients. In addition to reducing compute precision, we also
reduced the precision requirements for the master copy of weights from 32-bit
to 16-bit. We demonstrate state-of-the-art accuracy across multiple data sets
(imagenet-1K, WMT16) and a broader set of workloads (Resnet-18/34/50, GNMT,
Transformer) than previously reported. We propose an enhanced loss scaling
method to augment the reduced subnormal range of 8-bit floating point for
improved error propagation. We also examine the impact of quantization noise on
generalization and propose a stochastic rounding technique to address gradient
noise. As a result of applying all these techniques, we report slightly higher
validation accuracy compared to full precision baseline
Synaptic efficacy shapes resource limitations in working memory
Working memory (WM) is limited in its temporal length and capacity. Classic
conceptions of WM capacity assume the system possesses a finite number of
slots, but recent evidence suggests WM may be a continuous resource. Resource
models typically assume there is no hard upper bound on the number of items
that can be stored, but WM fidelity decreases with the number of items. We
analyze a neural field model of multi-item WM that associates each item with
the location of a bump in a finite spatial domain, considering items that span
a one-dimensional continuous feature space. Our analysis relates the neural
architecture of the network to accumulated errors and capacity limitations
arising during the delay period of a multi-item WM task. Networks with stronger
synapses support wider bumps that interact more, whereas networks with weaker
synapses support narrower bumps that are more susceptible to noise
perturbations. There is an optimal synaptic strength that both limits bump
interaction events and the effects of noise perturbations. This optimum shifts
to weaker synapses as the number of items stored in the network is increased.
Our model not only provides a neural circuit explanation for WM capacity, but
also speaks to how capacity relates to the arrangement of stored items in a
feature space.Comment: 26 pages, 12 figure
The Neural Network Pushdown Automaton: Model, Stack and Learning Simulations
In order for neural networks to learn complex languages or grammars, they
must have sufficient computational power or resources to recognize or generate
such languages. Though many approaches have been discussed, one ob- vious
approach to enhancing the processing power of a recurrent neural network is to
couple it with an external stack memory - in effect creating a neural network
pushdown automata (NNPDA). This paper discusses in detail this NNPDA - its
construction, how it can be trained and how useful symbolic information can be
extracted from the trained network.
In order to couple the external stack to the neural network, an optimization
method is developed which uses an error function that connects the learning of
the state automaton of the neural network to the learning of the operation of
the external stack. To minimize the error function using gradient descent
learning, an analog stack is designed such that the action and storage of
information in the stack are continuous. One interpretation of a continuous
stack is the probabilistic storage of and action on data. After training on
sample strings of an unknown source grammar, a quantization procedure extracts
from the analog stack and neural network a discrete pushdown automata (PDA).
Simulations show that in learning deterministic context-free grammars - the
balanced parenthesis language, 1*n0*n, and the deterministic Palindrome - the
extracted PDA is correct in the sense that it can correctly recognize unseen
strings of arbitrary length. In addition, the extracted PDAs can be shown to be
identical or equivalent to the PDAs of the source grammars which were used to
generate the training strings
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
We show that an end-to-end deep learning approach can be used to recognize
either English or Mandarin Chinese speech--two vastly different languages.
Because it replaces entire pipelines of hand-engineered components with neural
networks, end-to-end learning allows us to handle a diverse variety of speech
including noisy environments, accents and different languages. Key to our
approach is our application of HPC techniques, resulting in a 7x speedup over
our previous system. Because of this efficiency, experiments that previously
took weeks now run in days. This enables us to iterate more quickly to identify
superior architectures and algorithms. As a result, in several cases, our
system is competitive with the transcription of human workers when benchmarked
on standard datasets. Finally, using a technique called Batch Dispatch with
GPUs in the data center, we show that our system can be inexpensively deployed
in an online setting, delivering low latency when serving users at scale
Deep EHR: Chronic Disease Prediction Using Medical Notes
Early detection of preventable diseases is important for better disease
management, improved inter-ventions, and more efficient health-care resource
allocation. Various machine learning approacheshave been developed to utilize
information in Electronic Health Record (EHR) for this task. Majorityof
previous attempts, however, focus on structured fields and lose the vast amount
of information inthe unstructured notes. In this work we propose a general
multi-task framework for disease onsetprediction that combines both free-text
medical notes and structured information. We compareperformance of different
deep learning architectures including CNN, LSTM and hierarchical models.In
contrast to traditional text-based prediction models, our approach does not
require disease specificfeature engineering, and can handle negations and
numerical values that exist in the text. Ourresults on a cohort of about 1
million patients show that models using text outperform modelsusing just
structured data, and that models capable of using numerical values and
negations in thetext, in addition to the raw text, further improve performance.
Additionally, we compare differentvisualization methods for medical
professionals to interpret model predictions.Comment: Machine Learning for Health Care conferenc
A Survey on Methods and Theories of Quantized Neural Networks
Deep neural networks are the state-of-the-art methods for many real-world
tasks, such as computer vision, natural language processing and speech
recognition. For all its popularity, deep neural networks are also criticized
for consuming a lot of memory and draining battery life of devices during
training and inference. This makes it hard to deploy these models on mobile or
embedded devices which have tight resource constraints. Quantization is
recognized as one of the most effective approaches to satisfy the extreme
memory requirements that deep neural network models demand. Instead of adopting
32-bit floating point format to represent weights, quantized representations
store weights using more compact formats such as integers or even binary
numbers. Despite a possible degradation in predictive performance, quantization
provides a potential solution to greatly reduce the model size and the energy
consumption. In this survey, we give a thorough review of different aspects of
quantized neural networks. Current challenges and trends of quantized neural
networks are also discussed.Comment: 17 pages, 8 figure
A Sequential Embedding Approach for Item Recommendation with Heterogeneous Attributes
Attributes, such as metadata and profile, carry useful information which in
principle can help improve accuracy in recommender systems. However, existing
approaches have difficulty in fully leveraging attribute information due to
practical challenges such as heterogeneity and sparseness. These approaches
also fail to combine recurrent neural networks which have recently shown
effectiveness in item recommendations in applications such as video and music
browsing. To overcome the challenges and to harvest the advantages of sequence
models, we present a novel approach, Heterogeneous Attribute Recurrent Neural
Networks (HA-RNN), which incorporates heterogeneous attributes and captures
sequential dependencies in \textit{both} items and attributes. HA-RNN extends
recurrent neural networks with 1) a hierarchical attribute combination input
layer and 2) an output attribute embedding layer. We conduct extensive
experiments on two large-scale datasets. The new approach show significant
improvements over the state-of-the-art models. Our ablation experiments
demonstrate the effectiveness of the two components to address heterogeneous
attribute challenges including variable lengths and attribute sparseness. We
further investigate why sequence modeling works well by conducting exploratory
studies and show sequence models are more effective when data scale increases.Comment: A shorter version appeared in ICDM 2017 SERecsys worksho
- …