6,816 research outputs found

    Compositional Stochastic Average Gradient for Machine Learning and Related Applications

    Full text link
    Many machine learning, statistical inference, and portfolio optimization problems require minimization of a composition of expected value functions (CEVF). Of particular interest is the finite-sum versions of such compositional optimization problems (FS-CEVF). Compositional stochastic variance reduced gradient (C-SVRG) methods that combine stochastic compositional gradient descent (SCGD) and stochastic variance reduced gradient descent (SVRG) methods are the state-of-the-art methods for FS-CEVF problems. We introduce compositional stochastic average gradient descent (C-SAG) a novel extension of the stochastic average gradient method (SAG) to minimize composition of finite-sum functions. C-SAG, like SAG, estimates gradient by incorporating memory of previous gradient information. We present theoretical analyses of C-SAG which show that C-SAG, like SAG, and C-SVRG, achieves a linear convergence rate when the objective function is strongly convex; However, C-CAG achieves lower oracle query complexity per iteration than C-SVRG. Finally, we present results of experiments showing that C-SAG converges substantially faster than full gradient (FG), as well as C-SVRG

    Nonparametric Compositional Stochastic Optimization for Risk-Sensitive Kernel Learning

    Full text link
    In this work, we address optimization problems where the objective function is a nonlinear function of an expected value, i.e., compositional stochastic {strongly convex programs}. We consider the case where the decision variable is not vector-valued but instead belongs to a reproducing Kernel Hilbert Space (RKHS), motivated by risk-aware formulations of supervised learning and Markov Decision Processes defined over continuous spaces. We develop the first memory-efficient stochastic algorithm for this setting, which we call Compositional Online Learning with Kernels (COLK). COLK, at its core a two-time-scale stochastic approximation method, addresses the fact that (i) compositions of expected value problems cannot be addressed by classical stochastic gradient due to the presence of the inner expectation; and (ii) the RKHS-induced parameterization has complexity which is proportional to the iteration index which is mitigated through greedily constructed subspace projections. We establish almost sure convergence of COLK with attenuating step-sizes, and linear convergence in mean to a neighborhood with constant step-sizes, as well as the fact that its complexity is at-worst finite. The experiments with robust formulations of supervised learning demonstrate that COLK reliably converges, attains consistent performance across training runs, and thus overcomes overfitting

    Routing Networks and the Challenges of Modular and Compositional Computation

    Full text link
    Compositionality is a key strategy for addressing combinatorial complexity and the curse of dimensionality. Recent work has shown that compositional solutions can be learned and offer substantial gains across a variety of domains, including multi-task learning, language modeling, visual question answering, machine comprehension, and others. However, such models present unique challenges during training when both the module parameters and their composition must be learned jointly. In this paper, we identify several of these issues and analyze their underlying causes. Our discussion focuses on routing networks, a general approach to this problem, and examines empirically the interplay of these challenges and a variety of design decisions. In particular, we consider the effect of how the algorithm decides on module composition, how the algorithm updates the modules, and if the algorithm uses regularization

    Policy Evaluation in Continuous MDPs with Efficient Kernelized Gradient Temporal Difference

    Full text link
    We consider policy evaluation in infinite-horizon discounted Markov decision problems (MDPs) with infinite spaces. We reformulate this task a compositional stochastic program with a function-valued decision variable that belongs to a reproducing kernel Hilbert space (RKHS). We approach this problem via a new functional generalization of stochastic quasi-gradient methods operating in tandem with stochastic sparse subspace projections. The result is an extension of gradient temporal difference learning that yields nonlinearly parameterized value function estimates of the solution to the Bellman evaluation equation. Our main contribution is a memory-efficient non-parametric stochastic method guaranteed to converge exactly to the Bellman fixed point with probability 11 with attenuating step-sizes. Further, with constant step-sizes, we obtain mean convergence to a neighborhood and that the value function estimates have finite complexity. In the Mountain Car domain, we observe faster convergence to lower Bellman error solutions than existing approaches with a fraction of the required memory

    Examining the Mapping Functions of Denoising Autoencoders in Singing Voice Separation

    Full text link
    The goal of this work is to investigate what singing voice separation approaches based on neural networks learn from the data. We examine the mapping functions of neural networks based on the denoising autoencoder (DAE) model that are conditioned on the mixture magnitude spectra. To approximate the mapping functions, we propose an algorithm inspired by the knowledge distillation, denoted the neural couplings algorithm (NCA). The NCA yields a matrix that expresses the mapping of the mixture to the target source magnitude information. Using the NCA, we examine the mapping functions of three fundamental DAE-based models in music source separation; one with single-layer encoder and decoder, one with multi-layer encoder and single-layer decoder, and one using skip-filtering connections (SF) with a single-layer encoding and decoding. We first train these models with realistic data to estimate the singing voice magnitude spectra from the corresponding mixture. We then use the optimized models and test spectral data as input to the NCA. Our experimental findings show that approaches based on the DAE model learn scalar filtering operators, exhibiting a predominant diagonal structure in their corresponding mapping functions, limiting the exploitation of inter-frequency structure of music data. In contrast, skip-filtering connections are shown to assist the DAE model in learning filtering operators that exploit richer inter-frequency structures

    Efficient Lifelong Learning with A-GEM

    Full text link
    In lifelong learning, the learner is presented with a sequence of tasks, incrementally building a data-driven prior which may be leveraged to speed up learning of a new task. In this work, we investigate the efficiency of current lifelong approaches, in terms of sample complexity, computational and memory cost. Towards this end, we first introduce a new and a more realistic evaluation protocol, whereby learners observe each example only once and hyper-parameter selection is done on a small and disjoint set of tasks, which is not used for the actual learning experience and evaluation. Second, we introduce a new metric measuring how quickly a learner acquires a new skill. Third, we propose an improved version of GEM (Lopez-Paz & Ranzato, 2017), dubbed Averaged GEM (A-GEM), which enjoys the same or even better performance as GEM, while being almost as computationally and memory efficient as EWC (Kirkpatrick et al., 2016) and other regularization-based methods. Finally, we show that all algorithms including A-GEM can learn even more quickly if they are provided with task descriptors specifying the classification tasks under consideration. Our experiments on several standard lifelong learning benchmarks demonstrate that A-GEM has the best trade-off between accuracy and efficiency.Comment: Published as a conference paper at ICLR 201

    Stochastic Compositional Gradient Descent: Algorithms for Minimizing Compositions of Expected-Value Functions

    Full text link
    Classical stochastic gradient methods are well suited for minimizing expected-value objective functions. However, they do not apply to the minimization of a nonlinear function involving expected values or a composition of two expected-value functions, i.e., problems of the form minxEv[fv(Ew[gw(x)])]\min_x \mathbf{E}_v [f_v\big(\mathbf{E}_w [g_w(x)]\big)]. In order to solve this stochastic composition problem, we propose a class of stochastic compositional gradient descent (SCGD) algorithms that can be viewed as stochastic versions of quasi-gradient method. SCGD update the solutions based on noisy sample gradients of fv,gwf_v,g_{w} and use an auxiliary variable to track the unknown quantity Ew[gw(x)]\mathbf{E}_w[g_w(x)]. We prove that the SCGD converge almost surely to an optimal solution for convex optimization problems, as long as such a solution exists. The convergence involves the interplay of two iterations with different time scales. For nonsmooth convex problems, the SCGD achieve a convergence rate of O(k1/4)O(k^{-1/4}) in the general case and O(k2/3)O(k^{-2/3}) in the strongly convex case, after taking kk samples. For smooth convex problems, the SCGD can be accelerated to converge at a rate of O(k2/7)O(k^{-2/7}) in the general case and O(k4/5)O(k^{-4/5}) in the strongly convex case. For nonconvex problems, we prove that any limit point generated by SCGD is a stationary point, for which we also provide the convergence rate analysis. Indeed, the stochastic setting where one wants to optimize compositions of expected-value functions is very common in practice. The proposed SCGD methods find wide applications in learning, estimation, dynamic programming, etc

    Memorize or generalize? Searching for a compositional RNN in a haystack

    Full text link
    Neural networks are very powerful learning systems, but they do not readily generalize from one task to the other. This is partly due to the fact that they do not learn in a compositional way, that is, by discovering skills that are shared by different tasks, and recombining them to solve new problems. In this paper, we explore the compositional generalization capabilities of recurrent neural networks (RNNs). We first propose the lookup table composition domain as a simple setup to test compositional behaviour and show that it is theoretically possible for a standard RNN to learn to behave compositionally in this domain when trained with standard gradient descent and provided with additional supervision. We then remove this additional supervision and perform a search over a large number of model initializations to investigate the proportion of RNNs that can still converge to a compositional solution. We discover that a small but non-negligible proportion of RNNs do reach partial compositional solutions even without special architectural constraints. This suggests that a combination of gradient descent and evolutionary strategies directly favouring the minority models that developed more compositional approaches might suffice to lead standard RNNs towards compositional solutions.Comment: AEGAP Workshop (ICML 2018

    Bandit Structured Prediction for Neural Sequence-to-Sequence Learning

    Full text link
    Bandit structured prediction describes a stochastic optimization framework where learning is performed from partial feedback. This feedback is received in the form of a task loss evaluation to a predicted output structure, without having access to gold standard structures. We advance this framework by lifting linear bandit learning to neural sequence-to-sequence learning problems using attention-based recurrent neural networks. Furthermore, we show how to incorporate control variates into our learning algorithms for variance reduction and improved generalization. We present an evaluation on a neural machine translation task that shows improvements of up to 5.89 BLEU points for domain adaptation from simulated bandit feedback.Comment: ACL 201

    Improved Deep Spectral Convolution Network For Hyperspectral Unmixing With Multinomial Mixture Kernel and Endmember Uncertainty

    Full text link
    In this study, we propose a novel framework for hyperspectral unmixing by using an improved deep spectral convolution network (DSCN++) combined with endmember uncertainty. DSCN++ is used to compute high-level representations which are further modeled with Multinomial Mixture Model to estimate abundance maps. In the reconstruction step, a new trainable uncertainty term based on a nonlinear neural network model is introduced to provide robustness to endmember uncertainty. For the optimization of the coefficients of the multinomial model and the uncertainty term, Wasserstein Generative Adversarial Network (WGAN) is exploited to improve stability and to capture uncertainty. Experiments are performed on both real and synthetic datasets. The results validate that the proposed method obtains state-of-the-art hyperspectral unmixing performance particularly on the real datasets compared to the baseline techniques.Comment: Submitted to Journa
    corecore