6,885 research outputs found
Intelligence Processing Units Accelerate Neuromorphic Learning
Spiking neural networks (SNNs) have achieved orders of magnitude improvement
in terms of energy consumption and latency when performing inference with deep
learning workloads. Error backpropagation is presently regarded as the most
effective method for training SNNs, but in a twist of irony, when training on
modern graphics processing units (GPUs) this becomes more expensive than
non-spiking networks. The emergence of Graphcore's Intelligence Processing
Units (IPUs) balances the parallelized nature of deep learning workloads with
the sequential, reusable, and sparsified nature of operations prevalent when
training SNNs. IPUs adopt multi-instruction multi-data (MIMD) parallelism by
running individual processing threads on smaller data blocks, which is a
natural fit for the sequential, non-vectorized steps required to solve spiking
neuron dynamical state equations. We present an IPU-optimized release of our
custom SNN Python package, snnTorch, which exploits fine-grained parallelism by
utilizing low-level, pre-compiled custom operations to accelerate irregular and
sparse data access patterns that are characteristic of training SNN workloads.
We provide a rigorous performance assessment across a suite of commonly used
spiking neuron models, and propose methods to further reduce training run-time
via half-precision training. By amortizing the cost of sequential processing
into vectorizable population codes, we ultimately demonstrate the potential for
integrating domain-specific accelerators with the next generation of neural
networks.Comment: 10 pages, 9 figures, journa
Building high-level features using large scale unsupervised learning
We consider the problem of building high-level, class-specific feature
detectors from only unlabeled data. For example, is it possible to learn a face
detector using only unlabeled images? To answer this, we train a 9-layered
locally connected sparse autoencoder with pooling and local contrast
normalization on a large dataset of images (the model has 1 billion
connections, the dataset has 10 million 200x200 pixel images downloaded from
the Internet). We train this network using model parallelism and asynchronous
SGD on a cluster with 1,000 machines (16,000 cores) for three days. Contrary to
what appears to be a widely-held intuition, our experimental results reveal
that it is possible to train a face detector without having to label images as
containing a face or not. Control experiments show that this feature detector
is robust not only to translation but also to scaling and out-of-plane
rotation. We also find that the same network is sensitive to other high-level
concepts such as cat faces and human bodies. Starting with these learned
features, we trained our network to obtain 15.8% accuracy in recognizing 20,000
object categories from ImageNet, a leap of 70% relative improvement over the
previous state-of-the-art
Scalable and Sustainable Deep Learning via Randomized Hashing
Current deep learning architectures are growing larger in order to learn from
complex datasets. These architectures require giant matrix multiplication
operations to train millions of parameters. Conversely, there is another
growing trend to bring deep learning to low-power, embedded devices. The matrix
operations, associated with both training and testing of deep networks, are
very expensive from a computational and energy standpoint. We present a novel
hashing based technique to drastically reduce the amount of computation needed
to train and test deep networks. Our approach combines recent ideas from
adaptive dropouts and randomized hashing for maximum inner product search to
select the nodes with the highest activation efficiently. Our new algorithm for
deep learning reduces the overall computational cost of forward and
back-propagation by operating on significantly fewer (sparse) nodes. As a
consequence, our algorithm uses only 5% of the total multiplications, while
keeping on average within 1% of the accuracy of the original model. A unique
property of the proposed hashing based back-propagation is that the updates are
always sparse. Due to the sparse gradient updates, our algorithm is ideally
suited for asynchronous and parallel training leading to near linear speedup
with increasing number of cores. We demonstrate the scalability and
sustainability (energy efficiency) of our proposed algorithm via rigorous
experimental evaluations on several real datasets
Theano: new features and speed improvements
Theano is a linear algebra compiler that optimizes a user's
symbolically-specified mathematical computations to produce efficient low-level
implementations. In this paper, we present new features and efficiency
improvements to Theano, and benchmarks demonstrating Theano's performance
relative to Torch7, a recently introduced machine learning library, and to
RNNLM, a C++ library targeted at recurrent neural networks.Comment: Presented at the Deep Learning Workshop, NIPS 201
- …