181 research outputs found
Convergence Analysis of Two-layer Neural Networks with ReLU Activation
In recent years, stochastic gradient descent (SGD) based techniques has
become the standard tools for training neural networks. However, formal
theoretical understanding of why SGD can train neural networks in practice is
largely missing.
In this paper, we make progress on understanding this mystery by providing a
convergence analysis for SGD on a rich subset of two-layer feedforward networks
with ReLU activations. This subset is characterized by a special structure
called "identity mapping". We prove that, if input follows from Gaussian
distribution, with standard initialization of the weights, SGD
converges to the global minimum in polynomial number of steps. Unlike normal
vanilla networks, the "identity mapping" makes our network asymmetric and thus
the global minimum is unique. To complement our theory, we are also able to
show experimentally that multi-layer networks with this mapping have better
performance compared with normal vanilla networks.
Our convergence theorem differs from traditional non-convex optimization
techniques. We show that SGD converges to optimal in "two phases": In phase I,
the gradient points to the wrong direction, however, a potential function
gradually decreases. Then in phase II, SGD enters a nice one point convex
region and converges. We also show that the identity mapping is necessary for
convergence, as it moves the initial point to a better place for optimization.
Experiment verifies our claims
The Local Dimension of Deep Manifold
Based on our observation that there exists a dramatic drop for the singular
values of the fully connected layers or a single feature map of the
convolutional layer, and that the dimension of the concatenated feature vector
almost equals the summation of the dimension on each feature map, we propose a
singular value decomposition (SVD) based approach to estimate the dimension of
the deep manifolds for a typical convolutional neural network VGG19. We choose
three categories from the ImageNet, namely Persian Cat, Container Ship and
Volcano, and determine the local dimension of the deep manifolds of the deep
layers through the tangent space of a target image. Through several
augmentation methods, we found that the Gaussian noise method is closer to the
intrinsic dimension, as by adding random noise to an image we are moving in an
arbitrary dimension, and when the rank of the feature matrix of the augmented
images does not increase we are very close to the local dimension of the
manifold. We also estimate the dimension of the deep manifold based on the
tangent space for each of the maxpooling layers. Our results show that the
dimensions of different categories are close to each other and decline quickly
along the convolutional layers and fully connected layers. Furthermore, we show
that the dimensions decline quickly inside the Conv5 layer. Our work provides
new insights for the intrinsic structure of deep neural networks and helps
unveiling the inner organization of the black box of deep neural networks.Comment: 11 pages, 11 figure
On Expected Accuracy
We empirically investigate the (negative) expected accuracy as an alternative
loss function to cross entropy (negative log likelihood) for classification
tasks. Coupled with softmax activation, it has small derivatives over most of
its domain, and is therefore hard to optimize. A modified, leaky version is
evaluated on a variety of classification tasks, including digit recognition,
image classification, sequence tagging and tree tagging, using a variety of
neural architectures such as logistic regression, multilayer perceptron, CNN,
LSTM and Tree-LSTM. We show that it yields comparable or better accuracy
compared to cross entropy. Furthermore, the proposed objective is shown to be
more robust to label noise
Does Adam optimizer keep close to the optimal point?
The adaptive optimizer for training neural networks has continually evolved
to overcome the limitations of the previously proposed adaptive methods. Recent
studies have found the rare counterexamples that Adam cannot converge to the
optimal point. Those counterexamples reveal the distortion of Adam due to a
small second momentum from a small gradient. Unlike previous studies, we show
Adam cannot keep closer to the optimal point for not only the counterexamples
but also a general convex region when the effective learning rate exceeds the
certain bound. Subsequently, we propose an algorithm that overcomes Adam's
limitation and ensures that it can reach and stay at the optimal point region.Comment: Accepted as a workshop paper at the 33rd Conference on Neural
Information Processing Systems (NeurIPS 2019), Vancouver, Canad
Concavifiability and convergence: necessary and sufficient conditions for gradient descent analysis
Convergence of the gradient descent algorithm has been attracting renewed
interest due to its utility in deep learning applications. Even as multiple
variants of gradient descent were proposed, the assumption that the gradient of
the objective is Lipschitz continuous remained an integral part of the analysis
until recently. In this work, we look at convergence analysis by focusing on a
property that we term as concavifiability, instead of Lipschitz continuity of
gradients. We show that concavifiability is a necessary and sufficient
condition to satisfy the upper quadratic approximation which is key in proving
that the objective function decreases after every gradient descent update. We
also show that any gradient Lipschitz function satisfies concavifiability. A
constant known as the concavifier analogous to the gradient Lipschitz constant
is derived which is indicative of the optimal step size. As an application, we
demonstrate the utility of finding the concavifier the in convergence of
gradient descent through an example inspired by neural networks. We derive
bounds on the concavifier to obtain a fixed step size for a single hidden layer
ReLU network
Multi-level Residual Networks from Dynamical Systems View
Deep residual networks (ResNets) and their variants are widely used in many
computer vision applications and natural language processing tasks. However,
the theoretical principles for designing and training ResNets are still not
fully understood. Recently, several points of view have emerged to try to
interpret ResNet theoretically, such as unraveled view, unrolled iterative
estimation and dynamical systems view. In this paper, we adopt the dynamical
systems point of view, and analyze the lesioning properties of ResNet both
theoretically and experimentally. Based on these analyses, we additionally
propose a novel method for accelerating ResNet training. We apply the proposed
method to train ResNets and Wide ResNets for three image classification
benchmarks, reducing training time by more than 40% with superior or on-par
accuracy.Comment: Published as a conference paper at ICLR 201
SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data
Neural networks exhibit good generalization behavior in the
over-parameterized regime, where the number of network parameters exceeds the
number of observations. Nonetheless, current generalization bounds for neural
networks fail to explain this phenomenon. In an attempt to bridge this gap, we
study the problem of learning a two-layer over-parameterized neural network,
when the data is generated by a linearly separable function. In the case where
the network has Leaky ReLU activations, we provide both optimization and
generalization guarantees for over-parameterized networks. Specifically, we
prove convergence rates of SGD to a global minimum and provide generalization
guarantees for this global minimum that are independent of the network size.
Therefore, our result clearly shows that the use of SGD for optimization both
finds a global minimum, and avoids overfitting despite the high capacity of the
model. This is the first theoretical demonstration that SGD can avoid
overfitting, when learning over-specified neural network classifiers
On the Learnability of Deep Random Networks
In this paper we study the learnability of deep random networks from both
theoretical and practical points of view. On the theoretical front, we show
that the learnability of random deep networks with sign activation drops
exponentially with its depth. On the practical front, we find that the
learnability drops sharply with depth even with the state-of-the-art training
methods, suggesting that our stylized theoretical results are closer to
reality
Machine Learning Based on Natural Language Processing to Detect Cardiac Failure in Clinical Narratives
The purpose of the study presented herein is to develop a machine learning
algorithm based on natural language processing that automatically detects
whether a patient has a cardiac failure or a healthy condition by using
physician notes in Research Data Warehouse at CHU Sainte Justine Hospital.
First, a word representation learning technique was employed by using
bag-of-word (BoW), term frequency inverse document frequency (TFIDF), and
neural word embeddings (word2vec). Each representation technique aims to retain
the words semantic and syntactic analysis in critical care data. It helps to
enrich the mutual information for the word representation and leads to an
advantage for further appropriate analysis steps. Second, a machine learning
classifier was used to detect the patients condition for either cardiac failure
or stable patient through the created word representation vector space from the
previous step. This machine learning approach is based on a supervised binary
classification algorithm, including logistic regression (LR), Gaussian
Naive-Bayes (GaussianNB), and multilayer perceptron neural network (MLPNN).
Technically, it mainly optimizes the empirical loss during training the
classifiers. As a result, an automatic learning algorithm would be accomplished
to draw a high classification performance, including accuracy (acc), precision
(pre), recall (rec), and F1 score (f1). The results show that the combination
of TFIDF and MLPNN always outperformed other combinations with all overall
performance. In the case without any feature selection, the proposed framework
yielded an overall classification performance with acc, pre, rec, and f1 of 84%
and 82%, 85%, and 83%, respectively. Significantly, if the feature selection
was well applied, the overall performance would finally improve up to 4% for
each evaluation.Comment: Submitted to 2021 34th IEEE International Symposium on Computer-Based
Medical Systems (CBMS
The global optimum of shallow neural network is attained by ridgelet transform
We prove that the global minimum of the backpropagation (BP) training problem
of neural networks with an arbitrary nonlinear activation is given by the
ridgelet transform. A series of computational experiments show that there
exists an interesting similarity between the scatter plot of hidden parameters
in a shallow neural network after the BP training and the spectrum of the
ridgelet transform. By introducing a continuous model of neural networks, we
reduce the training problem to a convex optimization in an infinite dimensional
Hilbert space, and obtain the explicit expression of the global optimizer via
the ridgelet transform.Comment: under revie
- …