77,240 research outputs found
Tricks from Deep Learning
The deep learning community has devised a diverse set of methods to make gradient optimization, using large datasets, of large and highly complex models with deeply cascaded nonlinearities, practical. Taken as a whole, these methods constitute a breakthrough, allowing computational structures which are quite wide, very deep, and with an enormous number and variety of free parameters to be effectively optimized. The result now dominates much of practical machine learning, with applications in machine translation, computer vision, and speech recognition. Many of these methods, viewed through the lens of algorithmic differentiation (AD), can be seen as either addressing issues with the gradient itself, or finding ways of achieving increased efficiency using tricks that are AD-related, but not provided by current AD systems.
The goal of this paper is to explain not just those methods of most relevance to AD, but also the technical constraints and mindset which led to their discovery. After explaining this context, we present a "laundry list" of methods developed by the deep learning community. Two of these are discussed in further mathematical detail: a way to dramatically reduce the size of the tape when performing reverse-mode AD on a (theoretically) time-reversible process like an ODE integrator; and a new mathematical insight that allows for the implementation of a stochastic Newton's method
Tricks from Deep Learning
The deep learning community has devised a diverse set of methods to make gradient optimization, using large datasets, of large and highly complex models with deeply cascaded nonlinearities, practical. Taken as a whole, these methods constitute a breakthrough, allowing computational structures which are quite wide, very deep, and with an enormous number and variety of free parameters to be effectively optimized. The result now dominates much of practical machine learning, with applications in machine translation, computer vision, and speech recognition. Many of these methods, viewed through the lens of algorithmic differentiation (AD), can be seen as either addressing issues with the gradient itself, or finding ways of achieving increased efficiency using tricks that are AD-related, but not provided by current AD systems.
The goal of this paper is to explain not just those methods of most relevance to AD, but also the technical constraints and mindset which led to their discovery. After explaining this context, we present a "laundry list" of methods developed by the deep learning community. Two of these are discussed in further mathematical detail: a way to dramatically reduce the size of the tape when performing reverse-mode AD on a (theoretically) time-reversible process like an ODE integrator; and a new mathematical insight that allows for the implementation of a stochastic Newton's method
Using Magic in Computing Education and Outreach
This special session explores the use of magic tricks based on computer science ideas; magic tricks help grab students\u27 attention and can motivate them to invest more deeply in underlying CS concepts. Error detection ideas long used by computer scientists provide a particularly rich basis for working such magic\u27\u27, with a CS Unplugged parity check activity being a notable example. Prior work has shown that one can perform much more sophisticated tricks than the relatively well-known CS Unplugged activity, and these tricks can motivate analyses across a wide variety of computer science concepts and are relevant to learning objectives across grade levels from 2nd grade through graduate school. These tricks have piqued the interest of past audiences and have been performed with the aid of online implementations; this conference session will demonstrate enhanced implementations used to illuminate the underlying concepts rather than just to perform the tricks. The audience will participate in puzzling out how to apply relevant concepts as we work through a scaffolded series of tricks centering on error detection and correction. The implementations also provide a useful model for incorporating greater interaction than is typically found in current innovative online interactive textbooks. In addition, they are samples for possible programming assignments that can motivate students using CS Unplugged activities to actively pursue deep programming experiences
Recurrent Neural Network Training with Dark Knowledge Transfer
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM),
have gained much attention in automatic speech recognition (ASR). Although some
successful stories have been reported, training RNNs remains highly
challenging, especially with limited training data. Recent research found that
a well-trained model can be used as a teacher to train other child models, by
using the predictions generated by the teacher model as supervision. This
knowledge transfer learning has been employed to train simple neural nets with
a complex one, so that the final performance can reach a level that is
infeasible to obtain by regular training. In this paper, we employ the
knowledge transfer learning approach to train RNNs (precisely LSTM) using a
deep neural network (DNN) model as the teacher. This is different from most of
the existing research on knowledge transfer learning, since the teacher (DNN)
is assumed to be weaker than the child (RNN); however, our experiments on an
ASR task showed that it works fairly well: without applying any tricks on the
learning scheme, this approach can train RNNs successfully even with limited
training data.Comment: ICASSP 201
A Deep Reinforcement Learning Approach to Marginalized Importance Sampling with the Successor Representation
Marginalized importance sampling (MIS), which measures the density ratio
between the state-action occupancy of a target policy and that of a sampling
distribution, is a promising approach for off-policy evaluation. However,
current state-of-the-art MIS methods rely on complex optimization tricks and
succeed mostly on simple toy problems. We bridge the gap between MIS and deep
reinforcement learning by observing that the density ratio can be computed from
the successor representation of the target policy. The successor representation
can be trained through deep reinforcement learning methodology and decouples
the reward optimization from the dynamics of the environment, making the
resulting algorithm stable and applicable to high-dimensional domains. We
evaluate the empirical performance of our approach on a variety of challenging
Atari and MuJoCo environments.Comment: ICML 202
- …