Search CORE

14,501 research outputs found

Sparsely Activated Networks

Author: Bizopoulos Paschalis
Koutsouris Dimitrios
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 03/02/2021
Field of study

Previous literature on unsupervised learning focused on designing structural priors with the aim of learning meaningful features. However, this was done without considering the description length of the learned representations which is a direct and unbiased measure of the model complexity. In this paper, first we introduce the

\varphi

metric that evaluates unsupervised models based on their reconstruction accuracy and the degree of compression of their internal representations. We then present and define two activation functions (Identity, ReLU) as base of reference and three sparse activation functions (top-k absolutes, Extrema-Pool indices, Extrema) as candidate structures that minimize the previously defined

\varphi

. We lastly present Sparsely Activated Networks (SANs) that consist of kernels with shared weights that, during encoding, are convolved with the input and then passed through a sparse activation function. During decoding, the same weights are convolved with the sparse activation map and subsequently the partial reconstructions from each weight are summed to reconstruct the input. We compare SANs using the five previously defined activation functions on a variety of datasets (Physionet, UCI-epilepsy, MNIST, FMNIST) and show that models that are selected using

\varphi

have small description representation length and consist of interpretable kernels.Comment: 10 pages, 5 figures, 4 algorithms, 4 tables, submission to IEEE Transactions on Neural Networks and Learning System

arXiv.org e-Print Archive

Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints

Author: Ainslie Joshua
Dehghani Mostafa
Houlsby Neil
Komatsuzaki Aran
Lee-Thorp James
Mustafa Basil
Puigcerver Joan
Ruiz Carlos Riquelme
Tay Yi
Publication venue
Publication date: 09/12/2022
Field of study

Training large, deep neural networks to convergence can be prohibitively expensive. As a result, often only a small selection of popular, dense models are reused across different contexts and tasks. Increasingly, sparsely activated models, which seek to decouple model size from computation costs, are becoming an attractive alternative to dense models. Although more efficient in terms of quality and computation cost, sparse models remain data-hungry and costly to train from scratch in the large scale regime. In this work, we propose sparse upcycling -- a simple way to reuse sunk training costs by initializing a sparsely activated Mixture-of-Experts model from a dense checkpoint. We show that sparsely upcycled T5 Base, Large, and XL language models and Vision Transformer Base and Large models, respectively, significantly outperform their dense counterparts on SuperGLUE and ImageNet, using only ~50% of the initial dense pretraining sunk cost. The upcycled models also outperform sparse models trained from scratch on 100% of the initial dense pretraining computation budget

arXiv.org e-Print Archive

Soft Merging of Experts with Adaptive Routing

Author: Liu Haokun
Muqeeth Mohammed
Raffel Colin
Publication venue
Publication date: 06/06/2023
Field of study

Sparsely activated neural networks with conditional computation learn to route their inputs through different "expert" subnetworks, providing a form of modularity that densely activated models lack. Despite their possible benefits, models with learned routing often underperform their parameter-matched densely activated counterparts as well as models that use non-learned heuristic routing strategies. In this paper, we hypothesize that these shortcomings stem from the gradient estimation techniques used to train sparsely activated models that use non-differentiable discrete routing decisions. To address this issue, we introduce Soft Merging of Experts with Adaptive Routing (SMEAR), which avoids discrete routing by using a single "merged" expert constructed via a weighted average of all of the experts' parameters. By routing activations through a single merged expert, SMEAR does not incur a significant increase in computational costs and enables standard gradient-based training. We empirically validate that models using SMEAR outperform models that route based on metadata or learn sparse routing through gradient estimation. Furthermore, we provide qualitative analysis demonstrating that the experts learned via SMEAR exhibit a significant amount of specialization. All of the code used in our experiments is publicly available

arXiv.org e-Print Archive

Density-dependence of functional development in spiking cortical networks grown in vitro

Author: A Tang
B Anderson
Cara L. Santa Maria
DA Wagenaar
E Maeda
E Schneidman
F Rieke
G Bi
G Shahaf
G Spitsyna
H Markram
JH Kim
JM Beggs
JP Pfister
K Nakanishi
L Paninski
LMA Bettencourt
LMA Bettencourt
Luìs M. A. Bettencourt
M Abeles
M Diesmann
MA Corner
Marko A. Rodriguez
MI Ham
Michael I. Ham
O Feinerman
RM Herndon
Ryan A. Bennett
S Marom
TM Cover
Vadas Gintautas
Y Jimbo
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 21/11/2008
Field of study

During development, the mammalian brain differentiates into specialized regions with distinct functional abilities. While many factors contribute to functional specialization, we explore the effect of neuronal density on the development of neuronal interactions in vitro. Two types of cortical networks, dense and sparse, with 50,000 and 12,000 total cells respectively, are studied. Activation graphs that represent pairwise neuronal interactions are constructed using a competitive first response model. These graphs reveal that, during development in vitro, dense networks form activation connections earlier than sparse networks. Link entropy analysis of dense net- work activation graphs suggests that the majority of connections between electrodes are reciprocal in nature. Information theoretic measures reveal that early functional information interactions (among 3 cells) are synergetic in both dense and sparse networks. However, during later stages of development, previously synergetic relationships become primarily redundant in dense, but not in sparse networks. Large link entropy values in the activation graph are related to the domination of redundant ensembles in late stages of development in dense networks. Results demonstrate differences between dense and sparse networks in terms of informational groups, pairwise relationships, and activation graphs. These differences suggest that variations in cell density may result in different functional specialization of nervous system tissue in vivo.Comment: 10 pages, 7 figure

arXiv.org e-Print Archive

Crossref

Neural Distributed Autoassociative Memories: A Survey

Author: Frolov A. A.
Gayler R.
Gritsenko V. I.
Kleyko D.
Osipov E.
Rachkovskij D. A.
Publication venue: 'National Academy of Sciences of Ukraine (Co. LTD Ukrinformnauka)'
Publication date: 01/01/2017
Field of study

Introduction. Neural network models of autoassociative, distributed memory allow storage and retrieval of many items (vectors) where the number of stored items can exceed the vector dimension (the number of neurons in the network). This opens the possibility of a sublinear time search (in the number of stored items) for approximate nearest neighbors among vectors of high dimension. The purpose of this paper is to review models of autoassociative, distributed memory that can be naturally implemented by neural networks (mainly with local learning rules and iterative dynamics based on information locally available to neurons). Scope. The survey is focused mainly on the networks of Hopfield, Willshaw and Potts, that have connections between pairs of neurons and operate on sparse binary vectors. We discuss not only autoassociative memory, but also the generalization properties of these networks. We also consider neural networks with higher-order connections and networks with a bipartite graph structure for non-binary data with linear constraints. Conclusions. In conclusion we discuss the relations to similarity search, advantages and drawbacks of these techniques, and topics for further research. An interesting and still not completely resolved question is whether neural autoassociative memories can search for approximate nearest neighbors faster than other index structures for similarity search, in particular for the case of very high dimensional vectors.Comment: 31 page

arXiv.org e-Print Archive

Наукова електронна бібліотека періодичних видань НАН України (Vernadsky National Library of Ukraine)