2,036 research outputs found

    On the Neural Tangent Kernel of Deep Networks with Orthogonal Initialization

    Full text link
    The prevailing thinking is that orthogonal weights are crucial to enforcing dynamical isometry and speeding up training. The increase in learning speed that results from orthogonal initialization in linear networks has been well-proven. However, while the same is believed to also hold for nonlinear networks when the dynamical isometry condition is satisfied, the training dynamics behind this contention have not been thoroughly explored. In this work, we study the dynamics of ultra-wide networks across a range of architectures, including Fully Connected Networks (FCNs) and Convolutional Neural Networks (CNNs) with orthogonal initialization via neural tangent kernel (NTK). Through a series of propositions and lemmas, we prove that two NTKs, one corresponding to Gaussian weights and one to orthogonal weights, are equal when the network width is infinite. Further, during training, the NTK of an orthogonally-initialized infinite-width network should theoretically remain constant. This suggests that the orthogonal initialization cannot speed up training in the NTK (lazy training) regime, contrary to the prevailing thoughts. In order to explore under what circumstances can orthogonality accelerate training, we conduct a thorough empirical investigation outside the NTK regime. We find that when the hyper-parameters are set to achieve a linear regime in nonlinear activation, orthogonal initialization can improve the learning speed with a large learning rate or large depth.</jats:p

    Understanding deep learning through ultra-wide neural networks

    Full text link
    University of Technology Sydney. Faculty of Engineering and Information Technology.Deep learning has been responsible for a step-change in performance across machine learning, setting new benchmarks in a large number of applications. There is an urgent need to address the deep learning theory caused by the demand of understanding the principles of deep learning. One promising theoretical tool is the infinitely-wide neural network. This thesis focuses on the expressive power and optimization property of deep neural networks through investigating ultra-wide networks with four main contributions. We first use the mean-field theory to study the expressivity of deep dropout networks. The traditional mean-field analysis adopts the gradient independence assumption that weights used during feed-forward are drawn independently from the ones used in backpropagation. By breaking the independence assumption in the mean-field theory, we perform theoretical computation on linear dropout networks and a series of experiments on dropout networks. Furthermore, we investigate the maximum trainable length for deep dropout networks through a series of experiments and provide a more precise empirical formula that describes the trainable length than the original work. Secondly, we study the dynamics of fully-connected, wide, and nonlinear networks with orthogonal initialization via neural tangent kernel (NTK). We prove that two NTKs, one corresponding to Gaussian weights and one to orthogonal weights, are equal when the network width is infinite. This suggests that the orthogonal initialization cannot speed up training in the NTK regime. Last, with a thorough empirical investigation, we find that orthogonal initialization increases learning speeds in scenarios with a large learning rate or large depth. The third contribution is characterizing the implicit bias effect of deep linear networks for binary classification using the logistic loss with a large learning rate. We claim that depending on the separation conditions of data, the loss will find a flatter minimum with a large learning rate. We rigorously prove this claim under the assumption of degenerate data by overcoming the difficulty of the non-constant Hessian of logistic loss and further characterize the behavior of loss and Hessian for non-separable data. Finally, we demonstrate the trainability of deep Graph Convolutional Networks (GCNs) by studying the Gaussian Process Kernel (GPK) and Graph Neural Tangent Kernel (GNTK) of an infinitely-wide GCN, corresponding to the analysis on expressivity and trainability, respectively. We formulate the asymptotic behaviors of GNTK in the large depth, which enables us to reveal the dropping trainability of wide and deep GCNs at an exponential rate

    On Exact Computation with an Infinitely Wide Neural Net

    Full text link
    How well does a classic deep net architecture like AlexNet or VGG19 classify on a standard dataset such as CIFAR-10 when its width --- namely, number of channels in convolutional layers, and number of nodes in fully-connected internal layers --- is allowed to increase to infinity? Such questions have come to the forefront in the quest to theoretically understand deep learning and its mysteries about optimization and generalization. They also connect deep learning to notions such as Gaussian processes and kernels. A recent paper [Jacot et al., 2018] introduced the Neural Tangent Kernel (NTK) which captures the behavior of fully-connected deep nets in the infinite width limit trained by gradient descent; this object was implicit in some other recent papers. An attraction of such ideas is that a pure kernel-based method is used to capture the power of a fully-trained deep net of infinite width. The current paper gives the first efficient exact algorithm for computing the extension of NTK to convolutional neural nets, which we call Convolutional NTK (CNTK), as well as an efficient GPU implementation of this algorithm. This results in a significant new benchmark for the performance of a pure kernel-based method on CIFAR-10, being 10%10\% higher than the methods reported in [Novak et al., 2019], and only 6%6\% lower than the performance of the corresponding finite deep net architecture (once batch normalization, etc. are turned off). Theoretically, we also give the first non-asymptotic proof showing that a fully-trained sufficiently wide net is indeed equivalent to the kernel regression predictor using NTK.Comment: In NeurIPS 2019. Code available: https://github.com/ruosongwang/cnt

    On the Inductive Bias of Neural Tangent Kernels

    Get PDF
    State-of-the-art neural networks are heavily over-parameterized, making the optimization algorithm a crucial ingredient for learning predictive models with good generalization properties. A recent line of work has shown that in a certain over-parameterized regime, the learning dynamics of gradient descent are governed by a certain kernel obtained at initialization, called the neural tangent kernel. We study the inductive bias of learning in such a regime by analyzing this kernel and the corresponding function space (RKHS). In particular, we study smoothness, approximation, and stability properties of functions with finite norm, including stability to image deformations in the case of convolutional networks, and compare to other known kernels for similar architectures.Comment: NeurIPS 201
    • …
    corecore