14,501 research outputs found
Sparsely Activated Networks
Previous literature on unsupervised learning focused on designing structural
priors with the aim of learning meaningful features. However, this was done
without considering the description length of the learned representations which
is a direct and unbiased measure of the model complexity. In this paper, first
we introduce the metric that evaluates unsupervised models based on
their reconstruction accuracy and the degree of compression of their internal
representations. We then present and define two activation functions (Identity,
ReLU) as base of reference and three sparse activation functions (top-k
absolutes, Extrema-Pool indices, Extrema) as candidate structures that minimize
the previously defined . We lastly present Sparsely Activated Networks
(SANs) that consist of kernels with shared weights that, during encoding, are
convolved with the input and then passed through a sparse activation function.
During decoding, the same weights are convolved with the sparse activation map
and subsequently the partial reconstructions from each weight are summed to
reconstruct the input. We compare SANs using the five previously defined
activation functions on a variety of datasets (Physionet, UCI-epilepsy, MNIST,
FMNIST) and show that models that are selected using have small
description representation length and consist of interpretable kernels.Comment: 10 pages, 5 figures, 4 algorithms, 4 tables, submission to IEEE
Transactions on Neural Networks and Learning System
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints
Training large, deep neural networks to convergence can be prohibitively
expensive. As a result, often only a small selection of popular, dense models
are reused across different contexts and tasks. Increasingly, sparsely
activated models, which seek to decouple model size from computation costs, are
becoming an attractive alternative to dense models. Although more efficient in
terms of quality and computation cost, sparse models remain data-hungry and
costly to train from scratch in the large scale regime. In this work, we
propose sparse upcycling -- a simple way to reuse sunk training costs by
initializing a sparsely activated Mixture-of-Experts model from a dense
checkpoint. We show that sparsely upcycled T5 Base, Large, and XL language
models and Vision Transformer Base and Large models, respectively,
significantly outperform their dense counterparts on SuperGLUE and ImageNet,
using only ~50% of the initial dense pretraining sunk cost. The upcycled models
also outperform sparse models trained from scratch on 100% of the initial dense
pretraining computation budget
Soft Merging of Experts with Adaptive Routing
Sparsely activated neural networks with conditional computation learn to
route their inputs through different "expert" subnetworks, providing a form of
modularity that densely activated models lack. Despite their possible benefits,
models with learned routing often underperform their parameter-matched densely
activated counterparts as well as models that use non-learned heuristic routing
strategies. In this paper, we hypothesize that these shortcomings stem from the
gradient estimation techniques used to train sparsely activated models that use
non-differentiable discrete routing decisions. To address this issue, we
introduce Soft Merging of Experts with Adaptive Routing (SMEAR), which avoids
discrete routing by using a single "merged" expert constructed via a weighted
average of all of the experts' parameters. By routing activations through a
single merged expert, SMEAR does not incur a significant increase in
computational costs and enables standard gradient-based training. We
empirically validate that models using SMEAR outperform models that route based
on metadata or learn sparse routing through gradient estimation. Furthermore,
we provide qualitative analysis demonstrating that the experts learned via
SMEAR exhibit a significant amount of specialization. All of the code used in
our experiments is publicly available
Density-dependence of functional development in spiking cortical networks grown in vitro
During development, the mammalian brain differentiates into specialized
regions with distinct functional abilities. While many factors contribute to
functional specialization, we explore the effect of neuronal density on the
development of neuronal interactions in vitro. Two types of cortical networks,
dense and sparse, with 50,000 and 12,000 total cells respectively, are studied.
Activation graphs that represent pairwise neuronal interactions are constructed
using a competitive first response model. These graphs reveal that, during
development in vitro, dense networks form activation connections earlier than
sparse networks. Link entropy analysis of dense net- work activation graphs
suggests that the majority of connections between electrodes are reciprocal in
nature. Information theoretic measures reveal that early functional information
interactions (among 3 cells) are synergetic in both dense and sparse networks.
However, during later stages of development, previously synergetic
relationships become primarily redundant in dense, but not in sparse networks.
Large link entropy values in the activation graph are related to the domination
of redundant ensembles in late stages of development in dense networks. Results
demonstrate differences between dense and sparse networks in terms of
informational groups, pairwise relationships, and activation graphs. These
differences suggest that variations in cell density may result in different
functional specialization of nervous system tissue in vivo.Comment: 10 pages, 7 figure
Neural Distributed Autoassociative Memories: A Survey
Introduction. Neural network models of autoassociative, distributed memory
allow storage and retrieval of many items (vectors) where the number of stored
items can exceed the vector dimension (the number of neurons in the network).
This opens the possibility of a sublinear time search (in the number of stored
items) for approximate nearest neighbors among vectors of high dimension. The
purpose of this paper is to review models of autoassociative, distributed
memory that can be naturally implemented by neural networks (mainly with local
learning rules and iterative dynamics based on information locally available to
neurons). Scope. The survey is focused mainly on the networks of Hopfield,
Willshaw and Potts, that have connections between pairs of neurons and operate
on sparse binary vectors. We discuss not only autoassociative memory, but also
the generalization properties of these networks. We also consider neural
networks with higher-order connections and networks with a bipartite graph
structure for non-binary data with linear constraints. Conclusions. In
conclusion we discuss the relations to similarity search, advantages and
drawbacks of these techniques, and topics for further research. An interesting
and still not completely resolved question is whether neural autoassociative
memories can search for approximate nearest neighbors faster than other index
structures for similarity search, in particular for the case of very high
dimensional vectors.Comment: 31 page
- …