227 research outputs found
Competitive Gradient Descent
We introduce a new algorithm for the numerical computation of Nash equilibria
of competitive two-player games. Our method is a natural generalization of
gradient descent to the two-player setting where the update is given by the
Nash equilibrium of a regularized bilinear local approximation of the
underlying game. It avoids oscillatory and divergent behaviors seen in
alternating gradient descent. Using numerical experiments and rigorous
analysis, we provide a detailed comparison to methods based on \emph{optimism}
and \emph{consensus} and show that our method avoids making any unnecessary
changes to the gradient dynamics while achieving exponential (local)
convergence for (locally) convex-concave zero sum games. Convergence and
stability properties of our method are robust to strong interactions between
the players, without adapting the stepsize, which is not the case with previous
methods. In our numerical experiments on non-convex-concave problems, existing
methods are prone to divergence and instability due to their sensitivity to
interactions among the players, whereas we never observe divergence of our
algorithm. The ability to choose larger stepsizes furthermore allows our
algorithm to achieve faster convergence, as measured by the number of model
evaluations.Comment: Appeared in NeurIPS 2019. This version corrects an error in theorem
2.2. Source code used for the numerical experiments can be found under
http://github.com/f-t-s/CGD. A high-level overview of this work can be found
under http://f-t-s.github.io/projects/cgd
Multi-Object Classification and Unsupervised Scene Understanding Using Deep Learning Features and Latent Tree Probabilistic Models
Deep learning has shown state-of-art classification performance on datasets
such as ImageNet, which contain a single object in each image. However,
multi-object classification is far more challenging. We present a unified
framework which leverages the strengths of multiple machine learning methods,
viz deep learning, probabilistic models and kernel methods to obtain
state-of-art performance on Microsoft COCO, consisting of non-iconic images. We
incorporate contextual information in natural images through a conditional
latent tree probabilistic model (CLTM), where the object co-occurrences are
conditioned on the extracted fc7 features from pre-trained Imagenet CNN as
input. We learn the CLTM tree structure using conditional pairwise
probabilities for object co-occurrences, estimated through kernel methods, and
we learn its node and edge potentials by training a new 3-layer neural network,
which takes fc7 features as input. Object classification is carried out via
inference on the learnt conditional tree model, and we obtain significant gain
in precision-recall and F-measures on MS-COCO, especially for difficult object
categories. Moreover, the latent variables in the CLTM capture scene
information: the images with top activations for a latent node have common
themes such as being a grasslands or a food scene, and on on. In addition, we
show that a simple k-means clustering of the inferred latent nodes alone
significantly improves scene classification performance on the MIT-Indoor
dataset, without the need for any retraining, and without using scene labels
during training. Thus, we present a unified framework for multi-object
classification and unsupervised scene understanding
Efficient approaches for escaping higher order saddle points in non-convex optimization
Local search heuristics for non-convex optimizations are popular in applied
machine learning. However, in general it is hard to guarantee that such
algorithms even converge to a local minimum, due to the existence of
complicated saddle point structures in high dimensions. Many functions have
degenerate saddle points such that the first and second order derivatives
cannot distinguish them with local optima. In this paper we use higher order
derivatives to escape these saddle points: we design the first efficient
algorithm guaranteed to converge to a third order local optimum (while existing
techniques are at most second order). We also show that it is NP-hard to extend
this further to finding fourth order local optima
Training Input-Output Recurrent Neural Networks through Spectral Methods
We consider the problem of training input-output recurrent neural networks
(RNN) for sequence labeling tasks. We propose a novel spectral approach for
learning the network parameters. It is based on decomposition of the
cross-moment tensor between the output and a non-linear transformation of the
input, based on score functions. We guarantee consistent learning with
polynomial sample and computational complexity under transparent conditions
such as non-degeneracy of model parameters, polynomial activations for the
neurons, and a Markovian evolution of the input sequence. We also extend our
results to Bidirectional RNN which uses both previous and future information to
output the label at each time point, and is employed in many NLP tasks such as
POS tagging
Score Function Features for Discriminative Learning: Matrix and Tensor Framework
Feature learning forms the cornerstone for tackling challenging learning
problems in domains such as speech, computer vision and natural language
processing. In this paper, we consider a novel class of matrix and
tensor-valued features, which can be pre-trained using unlabeled samples. We
present efficient algorithms for extracting discriminative information, given
these pre-trained features and labeled samples for any related task. Our class
of features are based on higher-order score functions, which capture local
variations in the probability density function of the input. We establish a
theoretical framework to characterize the nature of discriminative information
that can be extracted from score-function features, when used in conjunction
with labeled samples. We employ efficient spectral decomposition algorithms (on
matrices and tensors) for extracting discriminative components. The advantage
of employing tensor-valued features is that we can extract richer
discriminative information in the form of an overcomplete representations.
Thus, we present a novel framework for employing generative models of the input
for discriminative learning.Comment: 29 page
Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods
Training neural networks is a challenging non-convex optimization problem,
and backpropagation or gradient descent can get stuck in spurious local optima.
We propose a novel algorithm based on tensor decomposition for guaranteed
training of two-layer neural networks. We provide risk bounds for our proposed
method, with a polynomial sample complexity in the relevant parameters, such as
input dimension and number of neurons. While learning arbitrary target
functions is NP-hard, we provide transparent conditions on the function and the
input for learnability. Our training method is based on tensor decomposition,
which provably converges to the global optimum, under a set of mild
non-degeneracy conditions. It consists of simple embarrassingly parallel linear
and multi-linear operations, and is competitive with standard stochastic
gradient descent (SGD), in terms of computational complexity. Thus, we propose
a computationally efficient method with guaranteed risk bounds for training
neural networks with one hidden layer.Comment: The tensor decomposition analysis is expanded, and the analysis of
ridge regression is added for recovering the parameters of last layer of
neural networ
Compact Tensor Pooling for Visual Question Answering
Performing high level cognitive tasks requires the integration of feature
maps with drastically different structure. In Visual Question Answering (VQA)
image descriptors have spatial structures, while lexical inputs inherently
follow a temporal sequence. The recently proposed Multimodal Compact Bilinear
pooling (MCB) forms the outer products, via count-sketch approximation, of the
visual and textual representation at each spatial location. While this
procedure preserves spatial information locally, outer-products are taken
independently for each fiber of the activation tensor, and therefore do not
include spatial context. In this work, we introduce multi-dimensional sketch
({MD-sketch}), a novel extension of count-sketch to tensors. Using this new
formulation, we propose Multimodal Compact Tensor Pooling (MCT) to fully
exploit the global spatial context during bilinear pooling operations.
Contrarily to MCB, our approach preserves spatial context by directly
convolving the MD-sketch from the visual tensor features with the text vector
feature using higher order FFT. Furthermore we apply MCT incrementally at each
step of the question embedding and accumulate the multi-modal vectors with a
second LSTM layer before the final answer is chosen
A Scale Mixture Perspective of Multiplicative Noise in Neural Networks
Corrupting the input and hidden layers of deep neural networks (DNNs) with
multiplicative noise, often drawn from the Bernoulli distribution (or
'dropout'), provides regularization that has significantly contributed to deep
learning's success. However, understanding how multiplicative corruptions
prevent overfitting has been difficult due to the complexity of a DNN's
functional form. In this paper, we show that when a Gaussian prior is placed on
a DNN's weights, applying multiplicative noise induces a Gaussian scale
mixture, which can be reparameterized to circumvent the problematic likelihood
function. Analysis can then proceed by using a type-II maximum likelihood
procedure to derive a closed-form expression revealing how regularization
evolves as a function of the network's weights. Results show that
multiplicative noise forces weights to become either sparse or invariant to
rescaling. We find our analysis has implications for model compression as it
naturally reveals a weight pruning rule that starkly contrasts with the
commonly used signal-to-noise ratio (SNR). While the SNR prunes weights with
large variances, seeing them as noisy, our approach recognizes their robustness
and retains them. We empirically demonstrate our approach has a strong
advantage over the SNR heuristic and is competitive to retraining with soft
targets produced from a teacher model
- …