34,198 research outputs found
Training Recurrent Neural Networks by Diffusion
This work presents a new algorithm for training recurrent neural networks
(although ideas are applicable to feedforward networks as well). The algorithm
is derived from a theory in nonconvex optimization related to the diffusion
equation. The contributions made in this work are two fold. First, we show how
some seemingly disconnected mechanisms used in deep learning such as smart
initialization, annealed learning rate, layerwise pretraining, and noise
injection (as done in dropout and SGD) arise naturally and automatically from
this framework, without manually crafting them into the algorithms. Second, we
present some preliminary results on comparing the proposed method against SGD.
It turns out that the new algorithm can achieve similar level of generalization
accuracy of SGD in much fewer number of epochs
A stochastic variational framework for fitting and diagnosing generalized linear mixed models
In stochastic variational inference, the variational Bayes objective function
is optimized using stochastic gradient approximation, where gradients computed
on small random subsets of data are used to approximate the true gradient over
the whole data set. This enables complex models to be fit to large data sets as
data can be processed in mini-batches. In this article, we extend stochastic
variational inference for conjugate-exponential models to nonconjugate models
and present a stochastic nonconjugate variational message passing algorithm for
fitting generalized linear mixed models that is scalable to large data sets. In
addition, we show that diagnostics for prior-likelihood conflict, which are
useful for Bayesian model criticism, can be obtained from nonconjugate
variational message passing automatically, as an alternative to
simulation-based Markov chain Monte Carlo methods. Finally, we demonstrate that
for moderate-sized data sets, convergence can be accelerated by using the
stochastic version of nonconjugate variational message passing in the initial
stage of optimization before switching to the standard version.Comment: 42 pages, 13 figures, 9 table
An Off-policy Policy Gradient Theorem Using Emphatic Weightings
Policy gradient methods are widely used for control in reinforcement
learning, particularly for the continuous action setting. There have been a
host of theoretically sound algorithms proposed for the on-policy setting, due
to the existence of the policy gradient theorem which provides a simplified
form for the gradient. In off-policy learning, however, where the behaviour
policy is not necessarily attempting to learn and follow the optimal policy for
the given task, the existence of such a theorem has been elusive. In this work,
we solve this open problem by providing the first off-policy policy gradient
theorem. The key to the derivation is the use of . We
develop a new actor-critic algorithm\unicode{x2014}called Actor Critic with
Emphatic weightings (ACE)\unicode{x2014}that approximates the simplified
gradients provided by the theorem. We demonstrate in a simple counterexample
that previous off-policy policy gradient methods\unicode{x2014}particularly
OffPAC and DPG\unicode{x2014}converge to the wrong solution whereas ACE finds
the optimal solution.Comment: Updated to final NeurIPS versio
Iterative Residual Image Deconvolution
Image deblurring, a.k.a. image deconvolution, recovers a clear image from
pixel superposition caused by blur degradation. Few deep convolutional neural
networks (CNN) succeed in addressing this task. In this paper, we first
demonstrate that the minimum-mean-square-error (MMSE) solution to image
deblurring can be interestingly unfolded into a series of residual components.
Based on this analysis, we propose a novel iterative residual deconvolution
(IRD) algorithm. Further, IRD motivates us to take one step forward to design
an explicable and effective CNN architecture for image deconvolution.
Specifically, a sequence of residual CNN units are deployed, whose intermediate
outputs are then concatenated and integrated, resulting in concatenated
residual convolutional network (CRCNet). The experimental results demonstrate
that proposed CRCNet not only achieves better quantitative metrics but also
recovers more visually plausible texture details compared with state-of-the-art
methods.Comment: rejected by AAAI 201
Few-shot Autoregressive Density Estimation: Towards Learning to Learn Distributions
Deep autoregressive models have shown state-of-the-art performance in density
estimation for natural images on large-scale datasets such as ImageNet.
However, such models require many thousands of gradient-based weight updates
and unique image examples for training. Ideally, the models would rapidly learn
visual concepts from only a handful of examples, similar to the manner in which
humans learns across many vision tasks. In this paper, we show how 1) neural
attention and 2) meta learning techniques can be used in combination with
autoregressive models to enable effective few-shot density estimation. Our
proposed modifications to PixelCNN result in state-of-the art few-shot density
estimation on the Omniglot dataset. Furthermore, we visualize the learned
attention policy and find that it learns intuitive algorithms for simple tasks
such as image mirroring on ImageNet and handwriting on Omniglot without
supervision. Finally, we extend the model to natural images and demonstrate
few-shot image generation on the Stanford Online Products dataset
Convolutional neural networks with fractional order gradient method
This paper proposes a fractional order gradient method for the backward
propagation of convolutional neural networks. To overcome the problem that
fractional order gradient method cannot converge to real extreme point, a
simplified fractional order gradient method is designed based on Caputo's
definition. The parameters within layers are updated by the designed gradient
method, but the propagations between layers still use integer order gradients,
and thus the complicated derivatives of composite functions are avoided and the
chain rule will be kept. By connecting every layers in series and adding loss
functions, the proposed convolutional neural networks can be trained smoothly
according to various tasks. Some practical experiments are carried out in order
to demonstrate fast convergence, high accuracy and ability to escape local
optimal point at last
Reinforcement Learning for Batch Bioprocess Optimization
Bioprocesses have received a lot of attention to produce clean and
sustainable alternatives to fossil-based materials. However, they are generally
difficult to optimize due to their unsteady-state operation modes and
stochastic behaviours. Furthermore, biological systems are highly complex,
therefore plant-model mismatch is often present. To address the aforementioned
challenges we propose a Reinforcement learning based optimization strategy for
batch processes.
In this work, we applied the Policy Gradient method from batch-to-batch to
update a control policy parametrized by a recurrent neural network. We assume
that a preliminary process model is available, which is exploited to obtain a
preliminary optimal control policy. Subsequently, this policy is updatedbased
on measurements from thetrueplant. The capabilities of our proposed approach
were tested on three case studies (one of which is nonsmooth) using a more
complex process model for thetruesystemembedded with adequate process
disturbance. Lastly, we discussed the advantages and disadvantages of this
strategy compared against current existing approaches such as nonlinear model
predictive control
How to iron out rough landscapes and get optimal performances: Averaged Gradient Descent and its application to tensor PCA
In many high-dimensional estimation problems the main task consists in
minimizing a cost function, which is often strongly non-convex when scanned in
the space of parameters to be estimated. A standard solution to flatten the
corresponding rough landscape consists in summing the losses associated to
different data points and obtain a smoother empirical risk. Here we propose a
complementary method that works for a single data point. The main idea is that
a large amount of the roughness is uncorrelated in different parts of the
landscape. One can then substantially reduce the noise by evaluating an
empirical average of the gradient obtained as a sum over many random
independent positions in the space of parameters to be optimized. We present an
algorithm, called Averaged Gradient Descent, based on this idea and we apply it
to tensor PCA, which is a very hard estimation problem. We show that Averaged
Gradient Descent over-performs physical algorithms such as gradient descent and
approximate message passing and matches the best algorithmic thresholds known
so far, obtained by tensor unfolding and methods based on sum-of-squares.Comment: 23 pages, 16 figures, including Supplementary Materia
SGD on Neural Networks Learns Functions of Increasing Complexity
We perform an experimental study of the dynamics of Stochastic Gradient
Descent (SGD) in learning deep neural networks for several real and synthetic
classification tasks. We show that in the initial epochs, almost all of the
performance improvement of the classifier obtained by SGD can be explained by a
linear classifier. More generally, we give evidence for the hypothesis that, as
iterations progress, SGD learns functions of increasing complexity. This
hypothesis can be helpful in explaining why SGD-learned classifiers tend to
generalize well even in the over-parameterized regime. We also show that the
linear classifier learned in the initial stages is "retained" throughout the
execution even if training is continued to the point of zero training error,
and complement this with a theoretical result in a simplified model. Key to our
work is a new measure of how well one classifier explains the performance of
another, based on conditional mutual information.Comment: Submitted to NeurIPS 201
On First-Order Meta-Learning Algorithms
This paper considers meta-learning problems, where there is a distribution of
tasks, and we would like to obtain an agent that performs well (i.e., learns
quickly) when presented with a previously unseen task sampled from this
distribution. We analyze a family of algorithms for learning a parameter
initialization that can be fine-tuned quickly on a new task, using only
first-order derivatives for the meta-learning updates. This family includes and
generalizes first-order MAML, an approximation to MAML obtained by ignoring
second-order derivatives. It also includes Reptile, a new algorithm that we
introduce here, which works by repeatedly sampling a task, training on it, and
moving the initialization towards the trained weights on that task. We expand
on the results from Finn et al. showing that first-order meta-learning
algorithms perform well on some well-established benchmarks for few-shot
classification, and we provide theoretical analysis aimed at understanding why
these algorithms work
- …