133,581 research outputs found
Biologically plausible deep learning -- but how far can we go with shallow networks?
Training deep neural networks with the error backpropagation algorithm is
considered implausible from a biological perspective. Numerous recent
publications suggest elaborate models for biologically plausible variants of
deep learning, typically defining success as reaching around 98% test accuracy
on the MNIST data set. Here, we investigate how far we can go on digit (MNIST)
and object (CIFAR10) classification with biologically plausible, local learning
rules in a network with one hidden layer and a single readout layer. The hidden
layer weights are either fixed (random or random Gabor filters) or trained with
unsupervised methods (PCA, ICA or Sparse Coding) that can be implemented by
local learning rules. The readout layer is trained with a supervised, local
learning rule. We first implement these models with rate neurons. This
comparison reveals, first, that unsupervised learning does not lead to better
performance than fixed random projections or Gabor filters for large hidden
layers. Second, networks with localized receptive fields perform significantly
better than networks with all-to-all connectivity and can reach backpropagation
performance on MNIST. We then implement two of the networks - fixed, localized,
random & random Gabor filters in the hidden layer - with spiking leaky
integrate-and-fire neurons and spike timing dependent plasticity to train the
readout layer. These spiking models achieve > 98.2% test accuracy on MNIST,
which is close to the performance of rate networks with one hidden layer
trained with backpropagation. The performance of our shallow network models is
comparable to most current biologically plausible models of deep learning.
Furthermore, our results with a shallow spiking network provide an important
reference and suggest the use of datasets other than MNIST for testing the
performance of future models of biologically plausible deep learning.Comment: 14 pages, 4 figure
Markov Chain Monte Carlo Bayesian Learning for Neural Networks
Conventional training methods for neural networks involve starting al a random location in the solution space of the network weights, navigating an error hyper surface to reach a minimum, and sometime stochastic based techniques (e.g., genetic algorithms) to avoid entrapment in a local minimum. It is further typically necessary to preprocess the data (e.g., normalization) to keep the training algorithm on course. Conversely, Bayesian based learning is an epistemological approach concerned with formally updating the plausibility of competing candidate hypotheses thereby obtaining a posterior distribution for the network weights conditioned on the available data and a prior distribution. In this paper, we developed a powerful methodology for estimating the full residual uncertainty in network weights and therefore network predictions by using a modified Jeffery's prior combined with a Metropolis Markov Chain Monte Carlo method
Learning with Local Gradients at the Edge
To enable learning on edge devices with fast convergence and low memory, we
present a novel backpropagation-free optimization algorithm dubbed Target
Projection Stochastic Gradient Descent (tpSGD). tpSGD generalizes direct random
target projection to work with arbitrary loss functions and extends target
projection for training recurrent neural networks (RNNs) in addition to
feedforward networks. tpSGD uses layer-wise stochastic gradient descent (SGD)
and local targets generated via random projections of the labels to train the
network layer-by-layer with only forward passes. tpSGD doesn't require
retaining gradients during optimization, greatly reducing memory allocation
compared to SGD backpropagation (BP) methods that require multiple instances of
the entire neural network weights, input/output, and intermediate results. Our
method performs comparably to BP gradient-descent within 5% accuracy on
relatively shallow networks of fully connected layers, convolutional layers,
and recurrent layers. tpSGD also outperforms other state-of-the-art
gradient-free algorithms in shallow models consisting of multi-layer
perceptrons, convolutional neural networks (CNNs), and RNNs with competitive
accuracy and less memory and time. We evaluate the performance of tpSGD in
training deep neural networks (e.g. VGG) and extend the approach to multi-layer
RNNs. These experiments highlight new research directions related to optimized
layer-based adaptor training for domain-shift using tpSGD at the edge
Decoding the Encoding of Functional Brain Networks: an fMRI Classification Comparison of Non-negative Matrix Factorization (NMF), Independent Component Analysis (ICA), and Sparse Coding Algorithms
Brain networks in fMRI are typically identified using spatial independent
component analysis (ICA), yet mathematical constraints such as sparse coding
and positivity both provide alternate biologically-plausible frameworks for
generating brain networks. Non-negative Matrix Factorization (NMF) would
suppress negative BOLD signal by enforcing positivity. Spatial sparse coding
algorithms ( Regularized Learning and K-SVD) would impose local
specialization and a discouragement of multitasking, where the total observed
activity in a single voxel originates from a restricted number of possible
brain networks.
The assumptions of independence, positivity, and sparsity to encode
task-related brain networks are compared; the resulting brain networks for
different constraints are used as basis functions to encode the observed
functional activity at a given time point. These encodings are decoded using
machine learning to compare both the algorithms and their assumptions, using
the time series weights to predict whether a subject is viewing a video,
listening to an audio cue, or at rest, in 304 fMRI scans from 51 subjects.
For classifying cognitive activity, the sparse coding algorithm of
Regularized Learning consistently outperformed 4 variations of ICA across
different numbers of networks and noise levels (p0.001). The NMF algorithms,
which suppressed negative BOLD signal, had the poorest accuracy. Within each
algorithm, encodings using sparser spatial networks (containing more
zero-valued voxels) had higher classification accuracy (p0.001). The success
of sparse coding algorithms may suggest that algorithms which enforce sparse
coding, discourage multitasking, and promote local specialization may capture
better the underlying source processes than those which allow inexhaustible
local processes such as ICA
Deep learning with asymmetric connections and Hebbian updates
We show that deep networks can be trained using Hebbian updates yielding
similar performance to ordinary back-propagation on challenging image datasets.
To overcome the unrealistic symmetry in connections between layers, implicit in
back-propagation, the feedback weights are separate from the feedforward
weights. The feedback weights are also updated with a local rule, the same as
the feedforward weights - a weight is updated solely based on the product of
activity of the units it connects. With fixed feedback weights as proposed in
Lillicrap et. al (2016) performance degrades quickly as the depth of the
network increases. If the feedforward and feedback weights are initialized with
the same values, as proposed in Zipser and Rumelhart (1990), they remain the
same throughout training thus precisely implementing back-propagation. We show
that even when the weights are initialized differently and at random, and the
algorithm is no longer performing back-propagation, performance is comparable
on challenging datasets. We also propose a cost function whose derivative can
be represented as a local Hebbian update on the last layer. Convolutional
layers are updated with tied weights across space, which is not biologically
plausible. We show that similar performance is achieved with untied layers,
also known as locally connected layers, corresponding to the connectivity
implied by the convolutional layers, but where weights are untied and updated
separately. In the linear case we show theoretically that the convergence of
the error to zero is accelerated by the update of the feedback weights
- …