7,153 research outputs found
Scalable Natural Gradient Langevin Dynamics in Practice
Stochastic Gradient Langevin Dynamics (SGLD) is a sampling scheme for
Bayesian modeling adapted to large datasets and models. SGLD relies on the
injection of Gaussian Noise at each step of a Stochastic Gradient Descent (SGD)
update. In this scheme, every component in the noise vector is independent and
has the same scale, whereas the parameters we seek to estimate exhibit strong
variations in scale and significant correlation structures, leading to poor
convergence and mixing times. We compare different preconditioning approaches
to the normalization of the noise vector and benchmark these approaches on the
following criteria: 1) mixing times of the multivariate parameter vector, 2)
regularizing effect on small dataset where it is easy to overfit, 3) covariate
shift detection and 4) resistance to adversarial examples.Comment: ICML 2018 Workshop on Non-Convex Optimizatio
Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks
We present weight normalization: a reparameterization of the weight vectors
in a neural network that decouples the length of those weight vectors from
their direction. By reparameterizing the weights in this way we improve the
conditioning of the optimization problem and we speed up convergence of
stochastic gradient descent. Our reparameterization is inspired by batch
normalization but does not introduce any dependencies between the examples in a
minibatch. This means that our method can also be applied successfully to
recurrent models such as LSTMs and to noise-sensitive applications such as deep
reinforcement learning or generative models, for which batch normalization is
less well suited. Although our method is much simpler, it still provides much
of the speed-up of full batch normalization. In addition, the computational
overhead of our method is lower, permitting more optimization steps to be taken
in the same amount of time. We demonstrate the usefulness of our method on
applications in supervised image recognition, generative modelling, and deep
reinforcement learning
Invariant backpropagation: how to train a transformation-invariant neural network
In many classification problems a classifier should be robust to small
variations in the input vector. This is a desired property not only for
particular transformations, such as translation and rotation in image
classification problems, but also for all others for which the change is small
enough to retain the object perceptually indistinguishable. We propose two
extensions of the backpropagation algorithm that train a neural network to be
robust to variations in the feature vector. While the first of them enforces
robustness of the loss function to all variations, the second method trains the
predictions to be robust to a particular variation which changes the loss
function the most. The second methods demonstrates better results, but is
slightly slower. We analytically compare the proposed algorithm with two the
most similar approaches (Tangent BP and Adversarial Training), and propose
their fast versions. In the experimental part we perform comparison of all
algorithms in terms of classification accuracy and robustness to noise on MNIST
and CIFAR-10 datasets. Additionally we analyze how the performance of the
proposed algorithm depends on the dataset size and data augmentation
Normalized Flat Minima: Exploring Scale Invariant Definition of Flat Minima for Neural Networks using PAC-Bayesian Analysis
The notion of flat minima has played a key role in the generalization studies
of deep learning models. However, existing definitions of the flatness are
known to be sensitive to the rescaling of parameters. The issue suggests that
the previous definitions of the flatness might not be a good measure of
generalization, because generalization is invariant to such rescalings. In this
paper, from the PAC-Bayesian perspective, we scrutinize the discussion
concerning the flat minima and introduce the notion of normalized flat minima,
which is free from the known scale dependence issues. Additionally, we
highlight the scale dependence of existing matrix-norm based generalization
error bounds similar to the existing flat minima definitions. Our modified
notion of the flatness does not suffer from the insufficiency, either,
suggesting it might provide better hierarchy in the hypothesis class
On the Importance of Consistency in Training Deep Neural Networks
We explain that the difficulties of training deep neural networks come from a
syndrome of three consistency issues. This paper describes our efforts in their
analysis and treatment. The first issue is the training speed inconsistency in
different layers. We propose to address it with an intuitive,
simple-to-implement, low footprint second-order method. The second issue is the
scale inconsistency between the layer inputs and the layer residuals. We
explain how second-order information provides favorable convenience in removing
this roadblock. The third and most challenging issue is the inconsistency in
residual propagation. Based on the fundamental theorem of linear algebra, we
provide a mathematical characterization of the famous vanishing gradient
problem. Thus, an important design principle for future optimization and neural
network design is derived. We conclude this paper with the construction of a
novel contractive neural network
Automated Segmentation of Lesions in Ultrasound Using Semi-pixel-wise Cycle Generative Adversarial Nets
Breast cancer is the most common invasive cancer with the highest cancer
occurrence in females. Handheld ultrasound is one of the most efficient ways to
identify and diagnose the breast cancer. The area and the shape information of
a lesion is very helpful for clinicians to make diagnostic decisions. In this
study we propose a new deep-learning scheme, semi-pixel-wise cycle generative
adversarial net (SPCGAN) for segmenting the lesion in 2D ultrasound. The method
takes the advantage of a fully connected convolutional neural network (FCN) and
a generative adversarial net to segment a lesion by using prior knowledge. We
compared the proposed method to a fully connected neural network and the level
set segmentation method on a test dataset consisting of 32 malignant lesions
and 109 benign lesions. Our proposed method achieved a Dice similarity
coefficient (DSC) of 0.92 while FCN and the level set achieved 0.90 and 0.79
respectively. Particularly, for malignant lesions, our method increases the DSC
(0.90) of the fully connected neural network to 0.93 significantly (p0.001).
The results show that our SPCGAN can obtain robust segmentation results and may
be used to relieve the radiologists' burden for annotation
Adaptive norms for deep learning with regularized Newton methods
We investigate the use of regularized Newton methods with adaptive norms for
optimizing neural networks. This approach can be seen as a second-order
counterpart of adaptive gradient methods, which we here show to be
interpretable as first-order trust region methods with ellipsoidal constraints.
In particular, we prove that the preconditioning matrix used in RMSProp and
Adam satisfies the necessary conditions for provable convergence of
second-order trust region methods with standard worst-case complexities on
general non-convex objectives. Furthermore, we run experiments across different
neural architectures and datasets to find that the ellipsoidal constraints
constantly outperform their spherical counterpart both in terms of number of
backpropagations and asymptotic loss value. Finally, we find comparable
performance to state-of-the-art first-order methods in terms of
backpropagations, but further advances in hardware are needed to render Newton
methods competitive in terms of computational time
Active Probabilistic Inference on Matrices for Pre-Conditioning in Stochastic Optimization
Pre-conditioning is a well-known concept that can significantly improve the
convergence of optimization algorithms. For noise-free problems, where good
pre-conditioners are not known a priori, iterative linear algebra methods offer
one way to efficiently construct them. For the stochastic optimization problems
that dominate contemporary machine learning, however, this approach is not
readily available. We propose an iterative algorithm inspired by classic
iterative linear solvers that uses a probabilistic model to actively infer a
pre-conditioner in situations where Hessian-projections can only be constructed
with strong Gaussian noise. The algorithm is empirically demonstrated to
efficiently construct effective pre-conditioners for stochastic gradient
descent and its variants. Experiments on problems of comparably low
dimensionality show improved convergence. In very high-dimensional problems,
such as those encountered in deep learning, the pre-conditioner effectively
becomes an automatic learning-rate adaptation scheme, which we also empirically
show to work well.Comment: Conferenc
Sequence Training of DNN Acoustic Models With Natural Gradient
Deep Neural Network (DNN) acoustic models often use discriminative sequence
training that optimises an objective function that better approximates the word
error rate (WER) than frame-based training. Sequence training is normally
implemented using Stochastic Gradient Descent (SGD) or Hessian Free (HF)
training. This paper proposes an alternative batch style optimisation framework
that employs a Natural Gradient (NG) approach to traverse through the parameter
space. By correcting the gradient according to the local curvature of the
KL-divergence, the NG optimisation process converges more quickly than HF.
Furthermore, the proposed NG approach can be applied to any sequence
discriminative training criterion. The efficacy of the NG method is shown using
experiments on a Multi-Genre Broadcast (MGB) transcription task that
demonstrates both the computational efficiency and the accuracy of the
resulting DNN models.Comment: In Proceedings of IEEE ASRU 201
A Stochastic LBFGS Algorithm for Radio Interferometric Calibration
We present a stochastic, limited-memory Broyden Fletcher Goldfarb Shanno
(LBFGS) algorithm that is suitable for handling very large amounts of data. A
direct application of this algorithm is radio interferometric calibration of
raw data at fine time and frequency resolution. Almost all existing radio
interferometric calibration algorithms assume that it is possible to fit the
dataset being calibrated into memory. Therefore, the raw data is averaged in
time and frequency to reduce its size by many orders of magnitude before
calibration is performed. However, this averaging is detrimental for the
detection of some signals of interest that have narrow bandwidth and time
duration such as fast radio bursts (FRBs). Using the proposed algorithm, it is
possible to calibrate data at such a fine resolution that they cannot be
entirely loaded into memory, thus preserving such signals. As an additional
demonstration, we use the proposed algorithm for training deep neural networks
and compare the performance against the mainstream first order optimization
algorithms that are used in deep learning.Comment: Draft, final version in IEEE Data Science Workshop 2019 proceeding
- …