6,816 research outputs found
Compositional Stochastic Average Gradient for Machine Learning and Related Applications
Many machine learning, statistical inference, and portfolio optimization
problems require minimization of a composition of expected value functions
(CEVF). Of particular interest is the finite-sum versions of such compositional
optimization problems (FS-CEVF). Compositional stochastic variance reduced
gradient (C-SVRG) methods that combine stochastic compositional gradient
descent (SCGD) and stochastic variance reduced gradient descent (SVRG) methods
are the state-of-the-art methods for FS-CEVF problems. We introduce
compositional stochastic average gradient descent (C-SAG) a novel extension of
the stochastic average gradient method (SAG) to minimize composition of
finite-sum functions. C-SAG, like SAG, estimates gradient by incorporating
memory of previous gradient information. We present theoretical analyses of
C-SAG which show that C-SAG, like SAG, and C-SVRG, achieves a linear
convergence rate when the objective function is strongly convex; However, C-CAG
achieves lower oracle query complexity per iteration than C-SVRG. Finally, we
present results of experiments showing that C-SAG converges substantially
faster than full gradient (FG), as well as C-SVRG
Nonparametric Compositional Stochastic Optimization for Risk-Sensitive Kernel Learning
In this work, we address optimization problems where the objective function
is a nonlinear function of an expected value, i.e., compositional stochastic
{strongly convex programs}. We consider the case where the decision variable is
not vector-valued but instead belongs to a reproducing Kernel Hilbert Space
(RKHS), motivated by risk-aware formulations of supervised learning and Markov
Decision Processes defined over continuous spaces.
We develop the first memory-efficient stochastic algorithm for this setting,
which we call Compositional Online Learning with Kernels (COLK). COLK, at its
core a two-time-scale stochastic approximation method, addresses the fact that
(i) compositions of expected value problems cannot be addressed by classical
stochastic gradient due to the presence of the inner expectation; and (ii) the
RKHS-induced parameterization has complexity which is proportional to the
iteration index which is mitigated through greedily constructed subspace
projections. We establish almost sure convergence of COLK with attenuating
step-sizes, and linear convergence in mean to a neighborhood with constant
step-sizes, as well as the fact that its complexity is at-worst finite. The
experiments with robust formulations of supervised learning demonstrate that
COLK reliably converges, attains consistent performance across training runs,
and thus overcomes overfitting
Routing Networks and the Challenges of Modular and Compositional Computation
Compositionality is a key strategy for addressing combinatorial complexity
and the curse of dimensionality. Recent work has shown that compositional
solutions can be learned and offer substantial gains across a variety of
domains, including multi-task learning, language modeling, visual question
answering, machine comprehension, and others. However, such models present
unique challenges during training when both the module parameters and their
composition must be learned jointly. In this paper, we identify several of
these issues and analyze their underlying causes. Our discussion focuses on
routing networks, a general approach to this problem, and examines empirically
the interplay of these challenges and a variety of design decisions. In
particular, we consider the effect of how the algorithm decides on module
composition, how the algorithm updates the modules, and if the algorithm uses
regularization
Policy Evaluation in Continuous MDPs with Efficient Kernelized Gradient Temporal Difference
We consider policy evaluation in infinite-horizon discounted Markov decision
problems (MDPs) with infinite spaces. We reformulate this task a compositional
stochastic program with a function-valued decision variable that belongs to a
reproducing kernel Hilbert space (RKHS). We approach this problem via a new
functional generalization of stochastic quasi-gradient methods operating in
tandem with stochastic sparse subspace projections. The result is an extension
of gradient temporal difference learning that yields nonlinearly parameterized
value function estimates of the solution to the Bellman evaluation equation.
Our main contribution is a memory-efficient non-parametric stochastic method
guaranteed to converge exactly to the Bellman fixed point with probability
with attenuating step-sizes. Further, with constant step-sizes, we obtain mean
convergence to a neighborhood and that the value function estimates have finite
complexity. In the Mountain Car domain, we observe faster convergence to lower
Bellman error solutions than existing approaches with a fraction of the
required memory
Examining the Mapping Functions of Denoising Autoencoders in Singing Voice Separation
The goal of this work is to investigate what singing voice separation
approaches based on neural networks learn from the data. We examine the mapping
functions of neural networks based on the denoising autoencoder (DAE) model
that are conditioned on the mixture magnitude spectra. To approximate the
mapping functions, we propose an algorithm inspired by the knowledge
distillation, denoted the neural couplings algorithm (NCA). The NCA yields a
matrix that expresses the mapping of the mixture to the target source magnitude
information. Using the NCA, we examine the mapping functions of three
fundamental DAE-based models in music source separation; one with single-layer
encoder and decoder, one with multi-layer encoder and single-layer decoder, and
one using skip-filtering connections (SF) with a single-layer encoding and
decoding. We first train these models with realistic data to estimate the
singing voice magnitude spectra from the corresponding mixture. We then use the
optimized models and test spectral data as input to the NCA. Our experimental
findings show that approaches based on the DAE model learn scalar filtering
operators, exhibiting a predominant diagonal structure in their corresponding
mapping functions, limiting the exploitation of inter-frequency structure of
music data. In contrast, skip-filtering connections are shown to assist the DAE
model in learning filtering operators that exploit richer inter-frequency
structures
Efficient Lifelong Learning with A-GEM
In lifelong learning, the learner is presented with a sequence of tasks,
incrementally building a data-driven prior which may be leveraged to speed up
learning of a new task. In this work, we investigate the efficiency of current
lifelong approaches, in terms of sample complexity, computational and memory
cost. Towards this end, we first introduce a new and a more realistic
evaluation protocol, whereby learners observe each example only once and
hyper-parameter selection is done on a small and disjoint set of tasks, which
is not used for the actual learning experience and evaluation. Second, we
introduce a new metric measuring how quickly a learner acquires a new skill.
Third, we propose an improved version of GEM (Lopez-Paz & Ranzato, 2017),
dubbed Averaged GEM (A-GEM), which enjoys the same or even better performance
as GEM, while being almost as computationally and memory efficient as EWC
(Kirkpatrick et al., 2016) and other regularization-based methods. Finally, we
show that all algorithms including A-GEM can learn even more quickly if they
are provided with task descriptors specifying the classification tasks under
consideration. Our experiments on several standard lifelong learning benchmarks
demonstrate that A-GEM has the best trade-off between accuracy and efficiency.Comment: Published as a conference paper at ICLR 201
Stochastic Compositional Gradient Descent: Algorithms for Minimizing Compositions of Expected-Value Functions
Classical stochastic gradient methods are well suited for minimizing
expected-value objective functions. However, they do not apply to the
minimization of a nonlinear function involving expected values or a composition
of two expected-value functions, i.e., problems of the form . In order to solve this
stochastic composition problem, we propose a class of stochastic compositional
gradient descent (SCGD) algorithms that can be viewed as stochastic versions of
quasi-gradient method. SCGD update the solutions based on noisy sample
gradients of and use an auxiliary variable to track the unknown
quantity . We prove that the SCGD converge almost surely
to an optimal solution for convex optimization problems, as long as such a
solution exists. The convergence involves the interplay of two iterations with
different time scales. For nonsmooth convex problems, the SCGD achieve a
convergence rate of in the general case and in the
strongly convex case, after taking samples. For smooth convex problems, the
SCGD can be accelerated to converge at a rate of in the general
case and in the strongly convex case. For nonconvex problems, we
prove that any limit point generated by SCGD is a stationary point, for which
we also provide the convergence rate analysis. Indeed, the stochastic setting
where one wants to optimize compositions of expected-value functions is very
common in practice. The proposed SCGD methods find wide applications in
learning, estimation, dynamic programming, etc
Memorize or generalize? Searching for a compositional RNN in a haystack
Neural networks are very powerful learning systems, but they do not readily
generalize from one task to the other. This is partly due to the fact that they
do not learn in a compositional way, that is, by discovering skills that are
shared by different tasks, and recombining them to solve new problems. In this
paper, we explore the compositional generalization capabilities of recurrent
neural networks (RNNs). We first propose the lookup table composition domain as
a simple setup to test compositional behaviour and show that it is
theoretically possible for a standard RNN to learn to behave compositionally in
this domain when trained with standard gradient descent and provided with
additional supervision. We then remove this additional supervision and perform
a search over a large number of model initializations to investigate the
proportion of RNNs that can still converge to a compositional solution. We
discover that a small but non-negligible proportion of RNNs do reach partial
compositional solutions even without special architectural constraints. This
suggests that a combination of gradient descent and evolutionary strategies
directly favouring the minority models that developed more compositional
approaches might suffice to lead standard RNNs towards compositional solutions.Comment: AEGAP Workshop (ICML 2018
Bandit Structured Prediction for Neural Sequence-to-Sequence Learning
Bandit structured prediction describes a stochastic optimization framework
where learning is performed from partial feedback. This feedback is received in
the form of a task loss evaluation to a predicted output structure, without
having access to gold standard structures. We advance this framework by lifting
linear bandit learning to neural sequence-to-sequence learning problems using
attention-based recurrent neural networks. Furthermore, we show how to
incorporate control variates into our learning algorithms for variance
reduction and improved generalization. We present an evaluation on a neural
machine translation task that shows improvements of up to 5.89 BLEU points for
domain adaptation from simulated bandit feedback.Comment: ACL 201
Improved Deep Spectral Convolution Network For Hyperspectral Unmixing With Multinomial Mixture Kernel and Endmember Uncertainty
In this study, we propose a novel framework for hyperspectral unmixing by
using an improved deep spectral convolution network (DSCN++) combined with
endmember uncertainty. DSCN++ is used to compute high-level representations
which are further modeled with Multinomial Mixture Model to estimate abundance
maps. In the reconstruction step, a new trainable uncertainty term based on a
nonlinear neural network model is introduced to provide robustness to endmember
uncertainty. For the optimization of the coefficients of the multinomial model
and the uncertainty term, Wasserstein Generative Adversarial Network (WGAN) is
exploited to improve stability and to capture uncertainty. Experiments are
performed on both real and synthetic datasets. The results validate that the
proposed method obtains state-of-the-art hyperspectral unmixing performance
particularly on the real datasets compared to the baseline techniques.Comment: Submitted to Journa
- …