21 research outputs found
Slope and generalization properties of neural networks
Neural networks are very successful tools in for example advanced
classification. From a statistical point of view, fitting a neural network may
be seen as a kind of regression, where we seek a function from the input space
to a space of classification probabilities that follows the "general" shape of
the data, but avoids overfitting by avoiding memorization of individual data
points. In statistics, this can be done by controlling the geometric complexity
of the regression function. We propose to do something similar when fitting
neural networks by controlling the slope of the network.
After defining the slope and discussing some of its theoretical properties,
we go on to show empirically in examples, using ReLU networks, that the
distribution of the slope of a well-trained neural network classifier is
generally independent of the width of the layers in a fully connected network,
and that the mean of the distribution only has a weak dependence on the model
architecture in general. The slope is of similar size throughout the relevant
volume, and varies smoothly. It also behaves as predicted in rescaling
examples. We discuss possible applications of the slope concept, such as using
it as a part of the loss function or stopping criterion during network
training, or ranking data sets in terms of their complexity
Counting the learnable functions of geometrically structured data
Cover's function counting theorem is a milestone in the theory of artificial neural networks. It provides an answer to the fundamental question of determining how many binary assignments (dichotomies) of
p
points in
n
dimensions can be linearly realized. Regrettably, it has proved hard to extend the same approach to more advanced problems than the classification of points. In particular, an emerging necessity is to find methods to deal with geometrically structured data, and specifically with non-point-like patterns. A prominent case is that of invariant recognition, whereby identification of a stimulus is insensitive to irrelevant transformations on the inputs (such as rotations or changes in perspective in an image). An object is thus represented by an extended perceptual manifold, consisting of inputs that are classified similarly. Here, we develop a function counting theory for structured data of this kind, by extending Cover's combinatorial technique, and we derive analytical expressions for the average number of dichotomies of generically correlated sets of patterns. As an application, we obtain a closed formula for the capacity of a binary classifier trained to distinguish general polytopes of any dimension. These results extend our theoretical understanding of the role of data structure in machine learning, and provide useful quantitative tools for the analysis of generalization, feature extraction, and invariant object recognition by neural networks
Optimal Learning with Excitatory and Inhibitory synapses
Characterizing the relation between weight structure and input/output
statistics is fundamental for understanding the computational capabilities of
neural circuits. In this work, I study the problem of storing associations
between analog signals in the presence of correlations, using methods from
statistical mechanics. I characterize the typical learning performance in terms
of the power spectrum of random input and output processes. I show that optimal
synaptic weight configurations reach a capacity of 0.5 for any fraction of
excitatory to inhibitory weights and have a peculiar synaptic distribution with
a finite fraction of silent synapses. I further provide a link between typical
learning performance and principal components analysis in single cases. These
results may shed light on the synaptic profile of brain circuits, such as
cerebellar structures, that are thought to engage in processing time-dependent
signals and performing on-line prediction.Comment: 16 pages, 5 figure
Fractional Deep Neural Network via Constrained Optimization
This paper introduces a novel algorithmic framework for a deep neural network
(DNN), which in a mathematically rigorous manner, allows us to incorporate
history (or memory) into the network -- it ensures all layers are connected to
one another. This DNN, called Fractional-DNN, can be viewed as a
time-discretization of a fractional in time nonlinear ordinary differential
equation (ODE). The learning problem then is a minimization problem subject to
that fractional ODE as constraints. We emphasize that an analogy between the
existing DNN and ODEs, with standard time derivative, is well-known by now. The
focus of our work is the Fractional-DNN. Using the Lagrangian approach, we
provide a derivation of the backward propagation and the design equations. We
test our network on several datasets for classification problems.
Fractional-DNN offers various advantages over the existing DNN. The key
benefits are a significant improvement to the vanishing gradient issue due to
the memory effect, and better handling of nonsmooth data due to the network's
ability to approximate non-smooth functions