36,313 research outputs found
Double Neural Counterfactual Regret Minimization
Counterfactual Regret Minimization (CRF) is a fundamental and effective
technique for solving Imperfect Information Games (IIG). However, the original
CRF algorithm only works for discrete state and action spaces, and the
resulting strategy is maintained as a tabular representation. Such tabular
representation limits the method from being directly applied to large games and
continuing to improve from a poor strategy profile. In this paper, we propose a
double neural representation for the imperfect information games, where one
neural network represents the cumulative regret, and the other represents the
average strategy. Furthermore, we adopt the counterfactual regret minimization
algorithm to optimize this double neural representation. To make neural
learning efficient, we also developed several novel techniques including a
robust sampling method, mini-batch Monte Carlo Counterfactual Regret
Minimization (MCCFR) and Monte Carlo Counterfactual Regret Minimization Plus
(MCCFR+) which may be of independent interests. Experimentally, we demonstrate
that the proposed double neural algorithm converges significantly better than
the reinforcement learning counterpart
On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic Non-Convex Optimization
For SGD based distributed stochastic optimization, computation complexity,
measured by the convergence rate in terms of the number of stochastic gradient
calls, and communication complexity, measured by the number of inter-node
communication rounds, are two most important performance metrics. The classical
data-parallel implementation of SGD over workers can achieve linear speedup
of its convergence rate but incurs an inter-node communication round at each
batch. We study the benefit of using dynamically increasing batch sizes in
parallel SGD for stochastic non-convex optimization by charactering the
attained convergence rate and the required number of communication rounds. We
show that for stochastic non-convex optimization under the P-L condition, the
classical data-parallel SGD with exponentially increasing batch sizes can
achieve the fastest known convergence with linear speedup using
only communication rounds. For general stochastic non-convex
optimization, we propose a Catalyst-like algorithm to achieve the fastest known
convergence with only
communication rounds.Comment: A short version is accepted to ICML 201
The Global Convergence of the Alternating Minimization Algorithm for Deep Neural Network Problems
In recent years, stochastic gradient descent (SGD) and its variants have been
the dominant optimization methods for training deep neural networks. However,
SGD suffers from limitations such as the lack of theoretical guarantees,
vanishing gradients, excessive sensitivity to input, and difficulties solving
highly non-smooth constraints and functions. To overcome these drawbacks,
alternating minimization-based methods for deep neural network optimization
have attracted fast-increasing attention recently. As an emerging and open
domain, however, several new challenges need to be addressed, including: 1)
there is no guarantee of global convergence under mild, practical conditions,
and 2) cubic time complexity in the size of feature dimensions. We therefore
propose a novel Deep Learning Alternating Minimization (DLAM) algorithm to deal
with these two challenges. Our innovative inequality-constrained formulation
infinitely approximates the original problem with non-convex equality
constraints, enabling our proof of global convergence of the DLAM algorithm
under mild, practical conditions. The time complexity is successfully reduced
from to via a dedicated algorithm design for subproblems that
is enhanced by iterative quadratic approximations and backtracking. Experiments
on benchmark datasets demonstrate the effectiveness of our proposed DLAM
algorithm
Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates
Recent works have shown that stochastic gradient descent (SGD) achieves the
fast convergence rates of full-batch gradient descent for over-parameterized
models satisfying certain interpolation conditions. However, the step-size used
in these works depends on unknown quantities and SGD's practical performance
heavily relies on the choice of this step-size. We propose to use line-search
techniques to automatically set the step-size when training models that can
interpolate the data. In the interpolation setting, we prove that SGD with a
stochastic variant of the classic Armijo line-search attains the deterministic
convergence rates for both convex and strongly-convex functions. Under
additional assumptions, SGD with Armijo line-search is shown to achieve fast
convergence for non-convex functions. Furthermore, we show that stochastic
extra-gradient with a Lipschitz line-search attains linear convergence for an
important class of non-convex functions and saddle-point problems satisfying
interpolation. To improve the proposed methods' practical performance, we give
heuristics to use larger step-sizes and acceleration. We compare the proposed
algorithms against numerous optimization methods on standard classification
tasks using both kernel methods and deep networks. The proposed methods result
in competitive performance across all models and datasets, while being robust
to the precise choices of hyper-parameters. For multi-class classification
using deep networks, SGD with Armijo line-search results in both faster
convergence and better generalization.Comment: Added a citation to the related work of Paul Tseng, and citations to
methods that had previously explored line-searches for deep learning
empiricall
Fast learning rate of deep learning via a kernel perspective
We develop a new theoretical framework to analyze the generalization error of
deep learning, and derive a new fast learning rate for two representative
algorithms: empirical risk minimization and Bayesian deep learning. The series
of theoretical analyses of deep learning has revealed its high expressive power
and universal approximation capability. Although these analyses are highly
nonparametric, existing generalization error analyses have been developed
mainly in a fixed dimensional parametric model. To compensate this gap, we
develop an infinite dimensional model that is based on an integral form as
performed in the analysis of the universal approximation capability. This
allows us to define a reproducing kernel Hilbert space corresponding to each
layer. Our point of view is to deal with the ordinary finite dimensional deep
neural network as a finite approximation of the infinite dimensional one. The
approximation error is evaluated by the degree of freedom of the reproducing
kernel Hilbert space in each layer. To estimate a good finite dimensional
model, we consider both of empirical risk minimization and Bayesian deep
learning. We derive its generalization error bound and it is shown that there
appears bias-variance trade-off in terms of the number of parameters of the
finite dimensional approximation. We show that the optimal width of the
internal layers can be determined through the degree of freedom and the
convergence rate can be faster than rate which has been shown
in the existing studies.Comment: 36 page
A Unified Framework for Training Neural Networks
The lack of mathematical tractability of Deep Neural Networks (DNNs) has
hindered progress towards having a unified convergence analysis of training
algorithms, in the general setting. We propose a unified optimization framework
for training different types of DNNs, and establish its convergence for
arbitrary loss, activation, and regularization functions, assumed to be smooth.
We show that framework generalizes well-known first- and second-order training
methods, and thus allows us to show the convergence of these methods for
various DNN architectures and learning tasks, as a special case of our
approach. We discuss some of its applications in training various DNN
architectures (e.g., feed-forward, convolutional, linear networks), to
regression and classification tasks.Comment: 15 pages, submitted to NIPS 201
When Does Stochastic Gradient Algorithm Work Well?
In this paper, we consider a general stochastic optimization problem which is
often at the core of supervised learning, such as deep learning and linear
classification. We consider a standard stochastic gradient descent (SGD) method
with a fixed, large step size and propose a novel assumption on the objective
function, under which this method has the improved convergence rates (to a
neighborhood of the optimal solutions). We then empirically demonstrate that
these assumptions hold for logistic regression and standard deep neural
networks on classical data sets. Thus our analysis helps to explain when
efficient behavior can be expected from the SGD method in training
classification models and deep neural networks
Deep Frank-Wolfe For Neural Network Optimization
Learning a deep neural network requires solving a challenging optimization
problem: it is a high-dimensional, non-convex and non-smooth minimization
problem with a large number of terms. The current practice in neural network
optimization is to rely on the stochastic gradient descent (SGD) algorithm or
its adaptive variants. However, SGD requires a hand-designed schedule for the
learning rate. In addition, its adaptive variants tend to produce solutions
that generalize less well on unseen data than SGD with a hand-designed
schedule. We present an optimization method that offers empirically the best of
both worlds: our algorithm yields good generalization performance while
requiring only one hyper-parameter. Our approach is based on a composite
proximal framework, which exploits the compositional nature of deep neural
networks and can leverage powerful convex optimization algorithms by design.
Specifically, we employ the Frank-Wolfe (FW) algorithm for SVM, which computes
an optimal step-size in closed-form at each time-step. We further show that the
descent direction is given by a simple backward pass in the network, yielding
the same computational cost per iteration as SGD. We present experiments on the
CIFAR and SNLI data sets, where we demonstrate the significant superiority of
our method over Adam, Adagrad, as well as the recently proposed BPGrad and
AMSGrad. Furthermore, we compare our algorithm to SGD with a hand-designed
learning rate schedule, and show that it provides similar generalization while
converging faster. The code is publicly available at
https://github.com/oval-group/dfw.Comment: Published as a conference paper at ICLR 201
Functional Gradient Boosting based on Residual Network Perception
Residual Networks (ResNets) have become state-of-the-art models in deep
learning and several theoretical studies have been devoted to understanding why
ResNet works so well. One attractive viewpoint on ResNet is that it is
optimizing the risk in a functional space by combining an ensemble of effective
features. In this paper, we adopt this viewpoint to construct a new gradient
boosting method, which is known to be very powerful in data analysis. To do so,
we formalize the gradient boosting perspective of ResNet mathematically using
the notion of functional gradients and propose a new method called ResFGB for
classification tasks by leveraging ResNet perception. Two types of
generalization guarantees are provided from the optimization perspective: one
is the margin bound and the other is the expected risk bound by the
sample-splitting technique. Experimental results show superior performance of
the proposed method over state-of-the-art methods such as LightGBM.Comment: 22 pages, 1 figure, 1 table. An extended version of ICML 2018 pape
A Dual-Dimer Method for Training Physics-Constrained Neural Networks with Minimax Architecture
Data sparsity is a common issue to train machine learning tools such as
neural networks for engineering and scientific applications, where experiments
and simulations are expensive. Recently physics-constrained neural networks
(PCNNs) were developed to reduce the required amount of training data. However,
the weights of different losses from data and physical constraints are adjusted
empirically in PCNNs. In this paper, a new physics-constrained neural network
with the minimax architecture (PCNN-MM) is proposed so that the weights of
different losses can be adjusted systematically. The training of the PCNN-MM is
searching the high-order saddle points of the objective function. A novel
saddle point search algorithm called Dual-Dimer method is developed. It is
demonstrated that the Dual-Dimer method is computationally more efficient than
the gradient descent ascent method for nonconvex-nonconcave functions and
provides additional eigenvalue information to verify search results. A heat
transfer example also shows that the convergence of PCNN-MMs is faster than
that of traditional PCNNs.Comment: 34 pages, 5 figures, accepted by neural network
- …