2,012 research outputs found
Unbiased scalable softmax optimization
Recent neural network and language models rely on softmax distributions with
an extremely large number of categories. Since calculating the softmax
normalizing constant in this context is prohibitively expensive, there is a
growing literature of efficiently computable but biased estimates of the
softmax. In this paper we propose the first unbiased algorithms for maximizing
the softmax likelihood whose work per iteration is independent of the number of
classes and datapoints (and no extra work is required at the end of each
epoch). We show that our proposed unbiased methods comprehensively outperform
the state-of-the-art on seven real world datasets
Pairwise Supervised Hashing with Bernoulli Variational Auto-Encoder and Self-Control Gradient Estimator
Semantic hashing has become a crucial component of fast similarity search in
many large-scale information retrieval systems, in particular, for text data.
Variational auto-encoders (VAEs) with binary latent variables as hashing codes
provide state-of-the-art performance in terms of precision for document
retrieval. We propose a pairwise loss function with discrete latent VAE to
reward within-class similarity and between-class dissimilarity for supervised
hashing. Instead of solving the optimization relying on existing biased
gradient estimators, an unbiased low-variance gradient estimator is adopted to
optimize the hashing function by evaluating the non-differentiable loss
function over two correlated sets of binary hashing codes to control the
variance of gradient estimates. This new semantic hashing framework achieves
superior performance compared to the state-of-the-arts, as demonstrated by our
comprehensive experiments.Comment: To appear in UAI 202
ARSM: Augment-REINFORCE-Swap-Merge Estimator for Gradient Backpropagation Through Categorical Variables
To address the challenge of backpropagating the gradient through categorical
variables, we propose the augment-REINFORCE-swap-merge (ARSM) gradient
estimator that is unbiased and has low variance. ARSM first uses variable
augmentation, REINFORCE, and Rao-Blackwellization to re-express the gradient as
an expectation under the Dirichlet distribution, then uses variable swapping to
construct differently expressed but equivalent expectations, and finally shares
common random numbers between these expectations to achieve significant
variance reduction. Experimental results show ARSM closely resembles the
performance of the true gradient for optimization in univariate settings;
outperforms existing estimators by a large margin when applied to categorical
variational auto-encoders; and provides a "try-and-see self-critic" variance
reduction method for discrete-action policy gradient, which removes the need of
estimating baselines by generating a random number of pseudo actions and
estimating their action-value functions.Comment: Published in ICML 2019. We have updated Section 4.2 and the Appendix
to reflect the improvements brought by fixing some bugs hidden in our
original code. Please find the Errata in the authors' websites and check the
updated code in Githu
Bayesian Incremental Learning for Deep Neural Networks
In industrial machine learning pipelines, data often arrive in parts.
Particularly in the case of deep neural networks, it may be too expensive to
train the model from scratch each time, so one would rather use a previously
learned model and the new data to improve performance. However, deep neural
networks are prone to getting stuck in a suboptimal solution when trained on
only new data as compared to the full dataset. Our work focuses on a continuous
learning setup where the task is always the same and new parts of data arrive
sequentially. We apply a Bayesian approach to update the posterior
approximation with each new piece of data and find this method to outperform
the traditional approach in our experiments
Efficient variational Bayesian neural network ensembles for outlier detection
In this work we perform outlier detection using ensembles of neural networks
obtained by variational approximation of the posterior in a Bayesian neural
network setting. The variational parameters are obtained by sampling from the
true posterior by gradient descent. We show our outlier detection results are
comparable to those obtained using other efficient ensembling methods.Comment: Presented at Workshop track - ICLR 201
ADMM-SOFTMAX : An ADMM Approach for Multinomial Logistic Regression
We present ADMM-Softmax, an alternating direction method of multipliers
(ADMM) for solving multinomial logistic regression (MLR) problems. Our method
is geared toward supervised classification tasks with many examples and
features. It decouples the nonlinear optimization problem in MLR into three
steps that can be solved efficiently. In particular, each iteration of
ADMM-Softmax consists of a linear least-squares problem, a set of independent
small-scale smooth, convex problems, and a trivial dual variable update.
Solution of the least-squares problem can be be accelerated by pre-computing a
factorization or preconditioner, and the separability in the smooth, convex
problem can be easily parallelized across examples. For two image
classification problems, we demonstrate that ADMM-Softmax leads to improved
generalization compared to a Newton-Krylov, a quasi Newton, and a stochastic
gradient descent method
Safeguarded Dynamic Label Regression for Generalized Noisy Supervision
Learning with noisy labels, which aims to reduce expensive labors on accurate
annotations, has become imperative in the Big Data era. Previous noise
transition based method has achieved promising results and presented a
theoretical guarantee on performance in the case of class-conditional noise.
However, this type of approaches critically depend on an accurate
pre-estimation of the noise transition, which is usually impractical.
Subsequent improvement adapts the pre-estimation along with the training
progress via a Softmax layer. However, the parameters in the Softmax layer are
highly tweaked for the fragile performance due to the ill-posed stochastic
approximation. To address these issues, we propose a Latent Class-Conditional
Noise model (LCCN) that naturally embeds the noise transition under a Bayesian
framework. By projecting the noise transition into a Dirichlet-distributed
space, the learning is constrained on a simplex based on the whole dataset,
instead of some ad-hoc parametric space. We then deduce a dynamic label
regression method for LCCN to iteratively infer the latent labels, to
stochastically train the classifier and to model the noise. Our approach
safeguards the bounded update of the noise transition, which avoids previous
arbitrarily tuning via a batch of samples. We further generalize LCCN for
open-set noisy labels and the semi-supervised setting. We perform extensive
experiments with the controllable noise data sets, CIFAR-10 and CIFAR-100, and
the agnostic noise data sets, Clothing1M and WebVision17. The experimental
results have demonstrated that the proposed model outperforms several
state-of-the-art methods.Comment: Submitted to Transactions on Image Processin
SHOPPER: A Probabilistic Model of Consumer Choice with Substitutes and Complements
We develop SHOPPER, a sequential probabilistic model of shopping data.
SHOPPER uses interpretable components to model the forces that drive how a
customer chooses products; in particular, we designed SHOPPER to capture how
items interact with other items. We develop an efficient posterior inference
algorithm to estimate these forces from large-scale data, and we analyze a
large dataset from a major chain grocery store. We are interested in answering
counterfactual queries about changes in prices. We found that SHOPPER provides
accurate predictions even under price interventions, and that it helps identify
complementary and substitutable pairs of products.Comment: Published at Annals of Applied Statistics. 27 pages, 4 figure
Multi-modal Geolocation Estimation Using Deep Neural Networks
Estimating the location where an image was taken based solely on the contents
of the image is a challenging task, even for humans, as properly labeling an
image in such a fashion relies heavily on contextual information, and is not as
simple as identifying a single object in the image. Thus any methods which
attempt to do so must somehow account for these complexities, and no single
model to date is completely capable of addressing all challenges. This work
contributes to the state of research in image geolocation inferencing by
introducing a novel global meshing strategy, outlining a variety of training
procedures to overcome the considerable data limitations when training these
models, and demonstrating how incorporating additional information can be used
to improve the overall performance of a geolocation inference model. In this
work, it is shown that Delaunay triangles are an effective type of mesh for
geolocation in relatively low volume scenarios when compared to results from
state of the art models which use quad trees and an order of magnitude more
training data. In addition, the time of posting, learned user albuming, and
other meta data are easily incorporated to improve geolocation by up to 11% for
country-level (750 km) locality accuracy to 3% for city-level (25 km)
localities
Backpropagation through the Void: Optimizing control variates for black-box gradient estimation
Gradient-based optimization is the foundation of deep learning and
reinforcement learning. Even when the mechanism being optimized is unknown or
not differentiable, optimization using high-variance or biased gradient
estimates is still often the best strategy. We introduce a general framework
for learning low-variance, unbiased gradient estimators for black-box functions
of random variables. Our method uses gradients of a neural network trained
jointly with model parameters or policies, and is applicable in both discrete
and continuous settings. We demonstrate this framework for training discrete
latent-variable models. We also give an unbiased, action-conditional extension
of the advantage actor-critic reinforcement learning algorithm.Comment: Published at ICLR 201
- …