10 research outputs found

    DeepOBS: A Deep Learning Optimizer Benchmark Suite

    Full text link
    Because the choice and tuning of the optimizer affects the speed, and ultimately the performance of deep learning, there is significant past and recent research in this area. Yet, perhaps surprisingly, there is no generally agreed-upon protocol for the quantitative and reproducible evaluation of optimization strategies for deep learning. We suggest routines and benchmarks for stochastic optimization, with special focus on the unique aspects of deep learning, such as stochasticity, tunability and generalization. As the primary contribution, we present DeepOBS, a Python package of deep learning optimization benchmarks. The package addresses key challenges in the quantitative assessment of stochastic optimizers, and automates most steps of benchmarking. The library includes a wide and extensible set of ready-to-use realistic optimization problems, such as training Residual Networks for image classification on ImageNet or character-level language prediction models, as well as popular classics like MNIST and CIFAR-10. The package also provides realistic baseline results for the most popular optimizers on these test problems, ensuring a fair comparison to the competition when benchmarking new optimizers, and without having to run costly experiments. It comes with output back-ends that directly produce LaTeX code for inclusion in academic publications. It supports TensorFlow and is available open source.Comment: Accepted at ICLR 2019. 9 pages, 3 figures, 2 table

    Limitations of the Empirical Fisher Approximation for Natural Gradient Descent

    Full text link
    Natural gradient descent, which preconditions a gradient descent update with the Fisher information matrix of the underlying statistical model, is a way to capture partial second-order information. Several highly visible works have advocated an approximation known as the empirical Fisher, drawing connections between approximate second-order methods and heuristics like Adam. We dispute this argument by showing that the empirical Fisher---unlike the Fisher---does not generally capture second-order information. We further argue that the conditions under which the empirical Fisher approaches the Fisher (and the Hessian) are unlikely to be met in practice, and that, even on simple optimization problems, the pathologies of the empirical Fisher can have undesirable effects.Comment: V3: Minor corrections (typographic errors

    Competitive Collaboration: Joint Unsupervised Learning of Depth, Camera Motion, Optical Flow and Motion Segmentation

    Full text link
    We address the unsupervised learning of several interconnected problems in low-level vision: single view depth prediction, camera motion estimation, optical flow, and segmentation of a video into the static scene and moving regions. Our key insight is that these four fundamental vision problems are coupled through geometric constraints. Consequently, learning to solve them together simplifies the problem because the solutions can reinforce each other. We go beyond previous work by exploiting geometry more explicitly and segmenting the scene into static and moving regions. To that end, we introduce Competitive Collaboration, a framework that facilitates the coordinated training of multiple specialized neural networks to solve complex problems. Competitive Collaboration works much like expectation-maximization, but with neural networks that act as both competitors to explain pixels that correspond to static or moving regions, and as collaborators through a moderator that assigns pixels to be either static or independently moving. Our novel method integrates all these problems in a common framework and simultaneously reasons about the segmentation of the scene into moving objects and the static background, the camera motion, depth of the static scene structure, and the optical flow of moving objects. Our model is trained without any supervision and achieves state-of-the-art performance among joint unsupervised methods on all sub-problems.Comment: CVPR 201

    Continual Learning with Low Rank Adaptation

    Full text link
    Recent work using pretrained transformers has shown impressive performance when fine-tuned with data from the downstream problem of interest. However, they struggle to retain that performance when the data characteristics changes. In this paper, we focus on continual learning, where a pre-trained transformer is updated to perform well on new data, while retaining its performance on data it was previously trained on. Earlier works have tackled this primarily through methods inspired from prompt tuning. We question this choice, and investigate the applicability of Low Rank Adaptation (LoRA) to continual learning. On a range of domain-incremental learning benchmarks, our LoRA-based solution, CoLoR, yields state-of-the-art performance, while still being as parameter efficient as the prompt tuning based methods.Comment: Accepted at Workshop on Distribution Shifts (DistShift), NeurIPS 202

    Noise-aware stochastic optimization

    No full text

    Noise-Aware Stochastic Optimization

    Get PDF
    Sochastische Optimierungsverfahren erster Ordnung, wie zum Beispiel das stochastische Gradientenverfahren (stochastic gradient descent, SGD), sind das Arbeitstier des modernen maschinellen Lernens. Mit ihrer Einfachheit und niedrigen Kosten pro Iteration haben sie den immensen Erfolg künstlicher neuronaler Netze maßgeblich vorangetrieben. Überraschender Weise sind diese stochastischen Optimierungsmethoden blind gegenüber Stochastizität. Weder sammeln sie Informationen über das stochastische Rauschen der verwendeten Gradientenauswertungen, noch verfügen sie über explizite Mechanismen, um ihr Verhalten an dieses Rauschen anzupassen. Diese Arbeit präsentiert Ansätze, stochastischen Optimierungsverfahren mittels Schätzung der (Ko- )Varianz der stochastischen Gradienten ein “Bewusstsein” für dieses Rauschen zu geben. Zuerst zeigen wir, wie solche Varianzschätzungen genutzt werden können, um die sogenannte Minibatchgröße bei SGD automatisch anzupassen. Dies kann eine üblicherweise verwendete ab- nehmende Schrittweite ersetzten, welche ihrerseits sehr viel schwerer zu automatisieren ist. Wir stellen heraus, dass beide Herangehensweisen aus einer gemeinsamen Perspektive betrachtet werden können, nämlich als Reduktion der mittleren quadratischen Abweichung des Gradientenschätzers. Als nächstes identifizieren wir in der außergewöhnlich populären Adam-Methode einen impliziten Varianzadaptierungsmechanismus. Wir betrachten Adam als eine Version von sign-SGD mit koordinatenweiser “Dämpfung” auf Basis des Signal-zu-Rausch-Verhältnisses des stochastischen Gradienten. Wir machen diesen Mechanismus explizit, formalisieren ihn, und übertragen ihn von sign-SGD zu SGD. Abschließend folgt eine kritische Diskussion einer Methodenfamilie, welche SGD mit der sogenannten “empirischen Fisher-Matrix” präkonditionert. Diese Matrix ist eng mit der Kovarianzmatrix des stochastischen Gradienten verwandt. Die empirische Fisher-Matrix wird üblicherweise als Approximation für die Fisher-Matrix und somit aus informationsgeometrischen Überlegungen motiviert. Wir kritisieren dieses Argument und zeigen, dass diese Approximation fundamentale theoretische Schwächen hat. Wir argumentieren, dass die Präkonditionierung mit der empirischen Fisher-Matrix besser als eine Form von Varianzadaptierung gesehen werden sollte.First-order stochastic optimization algorithms like stochastic gradient descent (SGD) are the workhorse of modern machine learning. With their simplicity and low per-iteration cost, they have powered the immense success of deep artificial neural network models. Surprisingly, these stochastic optimization methods are essentially unaware of stochasticity. Neither do they collect information about the stochastic noise associated with their gradient evaluations, nor do they have explicit mechanisms to adjust their behavior accordingly. This thesis presents approaches to make stochastic optimization methods noise-aware using estimates of the (co-)variance of stochastic gradients. First, we show how such variance estimates can be used to automatically adapt the minibatch size for SGD, i.e., the number of data points sampled in each iteration. This can replace the usual decreasing step size schedule required for convergence, which is much more challenging to automate. We highlight that both approaches can be viewed through the same lens of reducing the mean squared error of the gradient estimate. Next, we identify an implicit variance adaptation mechanism in the ubiquitous Adam method. In particular, we show that it can be seen as a version of sign-SGD with a coordinatewise “damping” based on the stochastic gradient’s signal-to-noise ratio. We make this variance adaptation mechanism explicit, formalize it, and transfer it from sign-SGD to SGD. Finally, we critically discuss a family of methods that preconditions stochastic gradient descent updates with the so-called “empirical Fisher” matrix, which is closely related to the stochastic gradient covariance matrix. This is usually motivated from information- geometric considerations as an approximation to the Fisher information matrix. We caution against this argument and show that the empirical Fisher approximation has fundamental theoretical flaws. We argue that preconditioning with the empirical Fisher is better understood as a form of variance adaptation
    corecore