84,939 research outputs found
Gated Linear Networks
This paper presents a new family of backpropagation-free neural
architectures, Gated Linear Networks (GLNs). What distinguishes GLNs from
contemporary neural networks is the distributed and local nature of their
credit assignment mechanism; each neuron directly predicts the target, forgoing
the ability to learn feature representations in favor of rapid online learning.
Individual neurons can model nonlinear functions via the use of data-dependent
gating in conjunction with online convex optimization. We show that this
architecture gives rise to universal learning capabilities in the limit, with
effective model capacity increasing as a function of network size in a manner
comparable with deep ReLU networks. Furthermore, we demonstrate that the GLN
learning mechanism possesses extraordinary resilience to catastrophic
forgetting, performing comparably to a MLP with dropout and Elastic Weight
Consolidation on standard benchmarks. These desirable theoretical and empirical
properties position GLNs as a complementary technique to contemporary offline
deep learning methods.Comment: arXiv admin note: substantial text overlap with arXiv:1712.0189
Globally Gated Deep Linear Networks
Recently proposed Gated Linear Networks present a tractable nonlinear network
architecture, and exhibit interesting capabilities such as learning with local
error signals and reduced forgetting in sequential learning. In this work, we
introduce a novel gating architecture, named Globally Gated Deep Linear
Networks (GGDLNs) where gating units are shared among all processing units in
each layer, thereby decoupling the architectures of the nonlinear but unlearned
gatings and the learned linear processing motifs. We derive exact equations for
the generalization properties in these networks in the finite-width
thermodynamic limit, defined by , where P
and N are the training sample size and the network width respectively. We find
that the statistics of the network predictor can be expressed in terms of
kernels that undergo shape renormalization through a data-dependent matrix
compared to the GP kernels. Our theory accurately captures the behavior of
finite width GGDLNs trained with gradient descent dynamics. We show that kernel
shape renormalization gives rise to rich generalization properties w.r.t.
network width, depth and L2 regularization amplitude. Interestingly, networks
with sufficient gating units behave similarly to standard ReLU networks.
Although gatings in the model do not participate in supervised learning, we
show the utility of unsupervised learning of the gating parameters.
Additionally, our theory allows the evaluation of the network's ability for
learning multiple tasks by incorporating task-relevant information into the
gating units. In summary, our work is the first exact theoretical solution of
learning in a family of nonlinear networks with finite width. The rich and
diverse behavior of the GGDLNs suggests that they are helpful analytically
tractable models of learning single and multiple tasks, in finite-width
nonlinear deep networks
Linear Memory Networks
Recurrent neural networks can learn complex transduction problems that
require maintaining and actively exploiting a memory of their inputs. Such
models traditionally consider memory and input-output functionalities
indissolubly entangled. We introduce a novel recurrent architecture based on
the conceptual separation between the functional input-output transformation
and the memory mechanism, showing how they can be implemented through different
neural components. By building on such conceptualization, we introduce the
Linear Memory Network, a recurrent model comprising a feedforward neural
network, realizing the non-linear functional transformation, and a linear
autoencoder for sequences, implementing the memory component. The resulting
architecture can be efficiently trained by building on closed-form solutions to
linear optimization problems. Further, by exploiting equivalence results
between feedforward and recurrent neural networks we devise a pretraining
schema for the proposed architecture. Experiments on polyphonic music datasets
show competitive results against gated recurrent networks and other state of
the art models
Extending Gated Linear Networks for Continual Learning
To incrementally learn multiple tasks from an indefinitely long stream of data
is a real challenge for traditional machine learning models. If not carefully
controlled, the learning of new knowledge strongly impacts on a model’s learned
abilities, making it to forget how to solve past tasks.
Continual learning faces this problem, called catastrophic forgetting, developing
models able to continually learn new tasks and adapt to changes in the
data distribution.
In this dissertation, we consider the recently proposed family of continual
learning models, called Gated Linear Networks (GLNs), and study two crucial
aspects impacting on the amount of catastrophic forgetting affecting gated linear
networks, namely, data standardization and gating mechanism.
Data standardization is particularly challenging in the online/continual learning
setting because data from future tasks is not available beforehand. The
results obtained using an online standardization method show a considerably
higher amount of forgetting compared to an offline –static– standardization.
Interestingly, with the latter standardization, we observe that GLNs show almost
no forgetting on the considered benchmark datasets.
Secondly, for an effective GLNs, it is essential to tailor the hyperparameters
of the gating mechanism to the data distribution. In this dissertation, we propose
a gating strategy based on a set of prototypes and the resulting Voronoi
tessellation. The experimental assessment shows that, in an ideal setting where
the data distribution is known, the proposed approach is more robust to different
data standardizations compared to the original one, based on a halfspace
gating mechanism, and shows improved predictive performance.
Finally, we propose an adaptive mechanism for the choice of prototypes,
which expands and shrinks the set of prototypes in an online fashion, making the
model suitable for practical continual learning applications. The experimental
results show that the adaptive model performances are close to the ideal scenario
where prototypes are directly sampled from the data distribution.To incrementally learn multiple tasks from an indefinitely long stream of data
is a real challenge for traditional machine learning models. If not carefully
controlled, the learning of new knowledge strongly impacts on a model’s learned
abilities, making it to forget how to solve past tasks.
Continual learning faces this problem, called catastrophic forgetting, developing
models able to continually learn new tasks and adapt to changes in the
data distribution.
In this dissertation, we consider the recently proposed family of continual
learning models, called Gated Linear Networks (GLNs), and study two crucial
aspects impacting on the amount of catastrophic forgetting affecting gated linear
networks, namely, data standardization and gating mechanism.
Data standardization is particularly challenging in the online/continual learning
setting because data from future tasks is not available beforehand. The
results obtained using an online standardization method show a considerably
higher amount of forgetting compared to an offline –static– standardization.
Interestingly, with the latter standardization, we observe that GLNs show almost
no forgetting on the considered benchmark datasets.
Secondly, for an effective GLNs, it is essential to tailor the hyperparameters
of the gating mechanism to the data distribution. In this dissertation, we propose
a gating strategy based on a set of prototypes and the resulting Voronoi
tessellation. The experimental assessment shows that, in an ideal setting where
the data distribution is known, the proposed approach is more robust to different
data standardizations compared to the original one, based on a halfspace
gating mechanism, and shows improved predictive performance.
Finally, we propose an adaptive mechanism for the choice of prototypes,
which expands and shrinks the set of prototypes in an online fashion, making the
model suitable for practical continual learning applications. The experimental
results show that the adaptive model performances are close to the ideal scenario
where prototypes are directly sampled from the data distribution
Toward Abstraction from Multi-modal Data: Empirical Studies on Multiple Time-scale Recurrent Models
The abstraction tasks are challenging for multi- modal sequences as they
require a deeper semantic understanding and a novel text generation for the
data. Although the recurrent neural networks (RNN) can be used to model the
context of the time-sequences, in most cases the long-term dependencies of
multi-modal data make the back-propagation through time training of RNN tend to
vanish in the time domain. Recently, inspired from Multiple Time-scale
Recurrent Neural Network (MTRNN), an extension of Gated Recurrent Unit (GRU),
called Multiple Time-scale Gated Recurrent Unit (MTGRU), has been proposed to
learn the long-term dependencies in natural language processing. Particularly
it is also able to accomplish the abstraction task for paragraphs given that
the time constants are well defined. In this paper, we compare the MTRNN and
MTGRU in terms of its learning performances as well as their abstraction
representation on higher level (with a slower neural activation). This was done
by conducting two studies based on a smaller data- set (two-dimension time
sequences from non-linear functions) and a relatively large data-set
(43-dimension time sequences from iCub manipulation tasks with multi-modal
data). We conclude that gated recurrent mechanisms may be necessary for
learning long-term dependencies in large dimension multi-modal data-sets (e.g.
learning of robot manipulation), even when natural language commands was not
involved. But for smaller learning tasks with simple time-sequences, generic
version of recurrent models, such as MTRNN, were sufficient to accomplish the
abstraction task.Comment: Accepted by IJCNN 201
- …