7,812 research outputs found
The loss surface of deep linear networks viewed through the algebraic geometry lens
By using the viewpoint of modern computational algebraic geometry, we explore
properties of the optimization landscapes of the deep linear neural network
models. After clarifying on the various definitions of "flat" minima, we show
that the geometrically flat minima, which are merely artifacts of residual
continuous symmetries of the deep linear networks, can be straightforwardly
removed by a generalized regularization. Then, we establish upper bounds
on the number of isolated stationary points of these networks with the help of
algebraic geometry. Using these upper bounds and utilizing a numerical
algebraic geometry method, we find all stationary points of modest depth and
matrix size. We show that in the presence of the non-zero regularization, deep
linear networks indeed possess local minima which are not the global minima.
Our computational results clarify certain aspects of the loss surfaces of deep
linear networks and provide novel insights.Comment: 16 pages (2-columns), 5 figure
Accumulation Bit-Width Scaling For Ultra-Low Precision Training Of Deep Networks
Efforts to reduce the numerical precision of computations in deep learning
training have yielded systems that aggressively quantize weights and
activations, yet employ wide high-precision accumulators for partial sums in
inner-product operations to preserve the quality of convergence. The absence of
any framework to analyze the precision requirements of partial sum
accumulations results in conservative design choices. This imposes an
upper-bound on the reduction of complexity of multiply-accumulate units. We
present a statistical approach to analyze the impact of reduced accumulation
precision on deep learning training. Observing that a bad choice for
accumulation precision results in loss of information that manifests itself as
a reduction in variance in an ensemble of partial sums, we derive a set of
equations that relate this variance to the length of accumulation and the
minimum number of bits needed for accumulation. We apply our analysis to three
benchmark networks: CIFAR-10 ResNet 32, ImageNet ResNet 18 and ImageNet
AlexNet. In each case, with accumulation precision set in accordance with our
proposed equations, the networks successfully converge to the single precision
floating-point baseline. We also show that reducing accumulation precision
further degrades the quality of the trained network, proving that our equations
produce tight bounds. Overall this analysis enables precise tailoring of
computation hardware to the application, yielding area- and power-optimal
systems.Comment: Published as a conference paper in ICLR 201
Numerically Recovering the Critical Points of a Deep Linear Autoencoder
Numerically locating the critical points of non-convex surfaces is a
long-standing problem central to many fields. Recently, the loss surfaces of
deep neural networks have been explored to gain insight into outstanding
questions in optimization, generalization, and network architecture design.
However, the degree to which recently-proposed methods for numerically
recovering critical points actually do so has not been thoroughly evaluated. In
this paper, we examine this issue in a case for which the ground truth is
known: the deep linear autoencoder. We investigate two sub-problems associated
with numerical critical point identification: first, because of large parameter
counts, it is infeasible to find all of the critical points for contemporary
neural networks, necessitating sampling approaches whose characteristics are
poorly understood; second, the numerical tolerance for accurately identifying a
critical point is unknown, and conservative tolerances are difficult to
satisfy. We first identify connections between recently-proposed methods and
well-understood methods in other fields, including chemical physics, economics,
and algebraic geometry. We find that several methods work well at recovering
certain information about loss surfaces, but fail to take an unbiased sample of
critical points. Furthermore, numerical tolerance must be very strict to ensure
that numerically-identified critical points have similar properties to true
analytical critical points. We also identify a recently-published Newton method
for optimization that outperforms previous methods as a critical point-finding
algorithm. We expect our results will guide future attempts to numerically
study critical points in large nonlinear neural networks
Exploiting Errors for Efficiency: A Survey from Circuits to Algorithms
When a computational task tolerates a relaxation of its specification or when
an algorithm tolerates the effects of noise in its execution, hardware,
programming languages, and system software can trade deviations from correct
behavior for lower resource usage. We present, for the first time, a synthesis
of research results on computing systems that only make as many errors as their
users can tolerate, from across the disciplines of computer aided design of
circuits, digital system design, computer architecture, programming languages,
operating systems, and information theory.
Rather than over-provisioning resources at each layer to avoid errors, it can
be more efficient to exploit the masking of errors occurring at one layer which
can prevent them from propagating to a higher layer. We survey tradeoffs for
individual layers of computing systems from the circuit level to the operating
system level and illustrate the potential benefits of end-to-end approaches
using two illustrative examples. To tie together the survey, we present a
consistent formalization of terminology, across the layers, which does not
significantly deviate from the terminology traditionally used by research
communities in their layer of focus.Comment: 35 page
Understanding the Energy and Precision Requirements for Online Learning
It is well-known that the precision of data, hyperparameters, and internal
representations employed in learning systems directly impacts its energy,
throughput, and latency. The precision requirements for the training algorithm
are also important for systems that learn on-the-fly. Prior work has shown that
the data and hyperparameters can be quantized heavily without incurring much
penalty in classification accuracy when compared to floating point
implementations. These works suffer from two key limitations. First, they
assume uniform precision for the classifier and for the training algorithm and
thus miss out on the opportunity to further reduce precision. Second, prior
works are empirical studies. In this article, we overcome both these
limitations by deriving analytical lower bounds on the precision requirements
of the commonly employed stochastic gradient descent (SGD) on-line learning
algorithm in the specific context of a support vector machine (SVM). Lower
bounds on the data precision are derived in terms of the the desired
classification accuracy and precision of the hyperparameters used in the
classifier. Additionally, lower bounds on the hyperparameter precision in the
SGD training algorithm are obtained. These bounds are validated using both
synthetic and the UCI breast cancer dataset. Additionally, the impact of these
precisions on the energy consumption of a fixed-point SVM with on-line training
is studied.Comment: 14 pages, 5 figures 4 of which have 2 subfigure
Adaptive Task Allocation for Mobile Edge Learning
This paper aims to establish a new optimization paradigm for implementing
realistic distributed learning algorithms, with performance guarantees, on
wireless edge nodes with heterogeneous computing and communication capacities.
We will refer to this new paradigm as `Mobile Edge Learning (MEL)'. The problem
of dynamic task allocation for MEL is considered in this paper with the aim to
maximize the learning accuracy, while guaranteeing that the total times of data
distribution/aggregation over heterogeneous channels, and local computing
iterations at the heterogeneous nodes, are bounded by a preset duration. The
problem is first formulated as a quadratically-constrained integer linear
problem. Being an NP-hard problem, the paper relaxes it into a non-convex
problem over real variables. We thus proposed two solutions based on deriving
analytical upper bounds of the optimal solution of this relaxed problem using
Lagrangian analysis and KKT conditions, and the use of suggest-and-improve
starting from equal batch allocation, respectively. The merits of these
proposed solutions are exhibited by comparing their performances to both
numerical approaches and the equal task allocation approach.Comment: 8 pages, 2 figures, submitted to IEEE WCNC Workshop 2019, Morocc
Binary Classification from Positive-Confidence Data
Can we learn a binary classifier from only positive data, without any
negative data or unlabeled data? We show that if one can equip positive data
with confidence (positive-confidence), one can successfully learn a binary
classifier, which we name positive-confidence (Pconf) classification. Our work
is related to one-class classification which is aimed at "describing" the
positive class by clustering-related methods, but one-class classification does
not have the ability to tune hyper-parameters and their aim is not on
"discriminating" positive and negative classes. For the Pconf classification
problem, we provide a simple empirical risk minimization framework that is
model-independent and optimization-independent. We theoretically establish the
consistency and an estimation error bound, and demonstrate the usefulness of
the proposed method for training deep neural networks through experiments.Comment: NeurIPS 2018 camera-ready version (this paper was selected for
spotlight presentation
The Newton Scheme for Deep Learning
We introduce a neural network (NN) strictly governed by Newton's Law, with
the nature required basis functions derived from the fundamental classic
mechanics. Then, by classifying the training model as a quick procedure of
'force pattern' recognition, we developed the Newton physics-based NS scheme.
Once the force pattern is confirmed, the neuro network simply does the checking
of the 'pattern stability' instead of the continuous fitting by computational
resource consuming big data-driven processing. In the given physics's law
system, once the field is confirmed, the mathematics bases for the force field
description actually are not diverged but denumerable, which can save the
function representations from the exhaustible available mathematics bases. In
this work, we endorsed Newton's Law into the deep learning technology and
proposed Newton Scheme (NS). Under NS, the user first identifies the path
pattern, like the constant acceleration movement.The object recognition
technology first loads mass information, then, the NS finds the matched
physical pattern and describe and predict the trajectory of the movements with
nearly zero error. We compare the major contribution of this NS with the TCN,
GRU and other physics inspired 'FIND-PDE' methods to demonstrate fundamental
and extended applications of how the NS works for the free-falling, pendulum
and curve soccer balls.The NS methodology provides more opportunity for the
future deep learning advances.Comment: 7 pages, 10 figure
Generalizing the Convolution Operator in Convolutional Neural Networks
Convolutional neural networks have become a main tool for solving many
machine vision and machine learning problems. A major element of these networks
is the convolution operator which essentially computes the inner product
between a weight vector and the vectorized image patches extracted by sliding a
window in the image planes of the previous layer. In this paper, we propose two
classes of surrogate functions for the inner product operation inherent in the
convolution operator and so attain two generalizations of the convolution
operator. The first one is the class of positive definite kernel functions
where their application is justified by the kernel trick. The second one is the
class of similarity measures defined based on a distance function. We justify
this by tracing back to the basic idea behind the neocognitron which is the
ancestor of CNNs. Both methods are then further generalized by allowing a
monotonically increasing function to be applied subsequently. Like any
trainable parameter in a neural network, the template pattern and the
parameters of the kernel/distance function are trained with the
back-propagation algorithm. As an aside, we use the proposed framework to
justify the use of sine activation function in CNNs. Our experiments on the
MNIST dataset show that the performance of ordinary CNNs can be achieved by
generalized CNNs based on weighted L1/L2 distances, proving the applicability
of the proposed generalization of the convolutional neural networks.Comment: Neural Process Lett (2019
Spurious Local Minima are Common in Two-Layer ReLU Neural Networks
We consider the optimization problem associated with training simple ReLU
neural networks of the form with respect to the
squared loss. We provide a computer-assisted proof that even if the input
distribution is standard Gaussian, even if the dimension is arbitrarily large,
and even if the target values are generated by such a network, with orthonormal
parameter vectors, the problem can still have spurious local minima once . By a concentration of measure argument, this implies that in high
input dimensions, \emph{nearly all} target networks of the relevant sizes lead
to spurious local minima. Moreover, we conduct experiments which show that the
probability of hitting such local minima is quite high, and increasing with the
network size. On the positive side, mild over-parameterization appears to
drastically reduce such local minima, indicating that an over-parameterization
assumption is necessary to get a positive result in this setting
- …