96,137 research outputs found
Advancing Data-Efficiency in Reinforcement Learning
In many real-world applications, including traffic control, robotics and web system
configurations, we are confronted with real-time decision-making problems where
data is limited. Reinforcement Learning (RL) allows us to construct a mathematical
framework to solve sequential decision-making problems under uncertainty. Under
low-data constraints, RL agents must be able to quickly identify relevant information in the observations, and to quickly learn how to act in order attain their long-term objective. While recent advancements in RL have demonstrated impressive
achievements, the end-to-end approach they take favours autonomy and flexibility
at the expense of fast learning. To be of practical use, there is an undeniable need
to improve the data-efficiency of existing systems.
Ideal RL agents would possess an optimal way of representing their environment, combined with an efficient mechanism for propagating reward signals across
the state space. This thesis investigates the problem of data-efficiency in RL from
these two aforementioned perspectives. A deep overview of the different representation learning methods in use in RL is provided. The aim of this overview is to
categorise the different representation learning approaches and highlight the impact
of the representation on data-efficiency. Then, this framing is used to develop two
main research directions. The first problem focuses on learning a representation that
captures the geometry of the problem. An RL mechanism that uses a scalable feature learning on graphs method to learn such rich representations is introduced, ultimately leading to more efficient value function approximation. Secondly, ET (λ ),
an algorithm that improves credit assignment in stochastic environments by propagating reward information counterfactually is presented. ET (λ ) results in faster earning compared to traditional methods that rely solely on temporal credit assignment. Overall, this thesis shows how a structural representation encoding the geometry of the state space, and counterfactual credit assignment are key characteristics
for data-efficient RL
SuperSpike: Supervised learning in multi-layer spiking neural networks
A vast majority of computation in the brain is performed by spiking neural
networks. Despite the ubiquity of such spiking, we currently lack an
understanding of how biological spiking neural circuits learn and compute
in-vivo, as well as how we can instantiate such capabilities in artificial
spiking circuits in-silico. Here we revisit the problem of supervised learning
in temporally coding multi-layer spiking neural networks. First, by using a
surrogate gradient approach, we derive SuperSpike, a nonlinear voltage-based
three factor learning rule capable of training multi-layer networks of
deterministic integrate-and-fire neurons to perform nonlinear computations on
spatiotemporal spike patterns. Second, inspired by recent results on feedback
alignment, we compare the performance of our learning rule under different
credit assignment strategies for propagating output errors to hidden units.
Specifically, we test uniform, symmetric and random feedback, finding that
simpler tasks can be solved with any type of feedback, while more complex tasks
require symmetric feedback. In summary, our results open the door to obtaining
a better scientific understanding of learning and computation in spiking neural
networks by advancing our ability to train them to solve nonlinear problems
involving transformations between different spatiotemporal spike-time patterns
PCM-Trace: Scalable Synaptic Eligibility Traces with Resistivity Drift of Phase-Change Materials
Dedicated hardware implementations of spiking neural networks that combine the advantages of mixed-signal neuromorphic circuits with those of emerging memory technologies have the potential of enabling ultra-low power pervasive sensory processing. To endow these systems with additional flexibility and the ability to learn to solve specific tasks, it is important to develop appropriate on-chip learning mechanisms.Recently, a new class of three-factor spike-based learning rules have been proposed that can solve the temporal credit assignment problem and approximate the error back-propagation algorithm on complex tasks. However, the efficient implementation of these rules on hybrid CMOS/memristive architectures is still an open challenge. Here we present a new neuromorphic building block,called PCM-trace, which exploits the drift behavior of phase-change materials to implement long lasting eligibility traces, a critical ingredient of three-factor learning rules. We demonstrate how the proposed approach improves the area efficiency by >10X compared to existing solutions and demonstrates a techno-logically plausible learning algorithm supported by experimental data from device measurement
PCM-trace: Scalable Synaptic Eligibility Traces with Resistivity Drift of Phase-Change Materials
Dedicated hardware implementations of spiking neural networks that combine
the advantages of mixed-signal neuromorphic circuits with those of emerging
memory technologies have the potential of enabling ultra-low power pervasive
sensory processing. To endow these systems with additional flexibility and the
ability to learn to solve specific tasks, it is important to develop
appropriate on-chip learning mechanisms.Recently, a new class of three-factor
spike-based learning rules have been proposed that can solve the temporal
credit assignment problem and approximate the error back-propagation algorithm
on complex tasks. However, the efficient implementation of these rules on
hybrid CMOS/memristive architectures is still an open challenge. Here we
present a new neuromorphic building block,called PCM-trace, which exploits the
drift behavior of phase-change materials to implement long lasting eligibility
traces, a critical ingredient of three-factor learning rules. We demonstrate
how the proposed approach improves the area efficiency by >10X compared to
existing solutions and demonstrates a techno-logically plausible learning
algorithm supported by experimental data from device measurementsComment: Typos are fixe
Adaptive Value Decomposition with Greedy Marginal Contribution Computation for Cooperative Multi-Agent Reinforcement Learning
Real-world cooperation often requires intensive coordination among agents
simultaneously. This task has been extensively studied within the framework of
cooperative multi-agent reinforcement learning (MARL), and value decomposition
methods are among those cutting-edge solutions. However, traditional methods
that learn the value function as a monotonic mixing of per-agent utilities
cannot solve the tasks with non-monotonic returns. This hinders their
application in generic scenarios. Recent methods tackle this problem from the
perspective of implicit credit assignment by learning value functions with
complete expressiveness or using additional structures to improve cooperation.
However, they are either difficult to learn due to large joint action spaces or
insufficient to capture the complicated interactions among agents which are
essential to solving tasks with non-monotonic returns. To address these
problems, we propose a novel explicit credit assignment method to address the
non-monotonic problem. Our method, Adaptive Value decomposition with Greedy
Marginal contribution (AVGM), is based on an adaptive value decomposition that
learns the cooperative value of a group of dynamically changing agents. We
first illustrate that the proposed value decomposition can consider the
complicated interactions among agents and is feasible to learn in large-scale
scenarios. Then, our method uses a greedy marginal contribution computed from
the value decomposition as an individual credit to incentivize agents to learn
the optimal cooperative policy. We further extend the module with an action
encoder to guarantee the linear time complexity for computing the greedy
marginal contribution. Experimental results demonstrate that our method
achieves significant performance improvements in several non-monotonic domains.Comment: This paper is accepted by aamas 202
A Learning Theory for Reward-Modulated Spike-Timing-Dependent Plasticity with Application to Biofeedback
Reward-modulated spike-timing-dependent plasticity (STDP) has recently emerged as
a candidate for a learning rule that could explain how behaviorally relevant
adaptive changes in complex networks of spiking neurons could be achieved in a
self-organizing manner through local synaptic plasticity. However, the
capabilities and limitations of this learning rule could so far only be tested
through computer simulations. This article provides tools for an analytic
treatment of reward-modulated STDP, which allows us to predict under which
conditions reward-modulated STDP will achieve a desired learning effect. These
analytical results imply that neurons can learn through reward-modulated STDP to
classify not only spatial but also temporal firing patterns of presynaptic
neurons. They also can learn to respond to specific presynaptic firing patterns
with particular spike patterns. Finally, the resulting learning theory predicts
that even difficult credit-assignment problems, where it is very hard to tell
which synaptic weights should be modified in order to increase the global reward
for the system, can be solved in a self-organizing manner through
reward-modulated STDP. This yields an explanation for a fundamental experimental
result on biofeedback in monkeys by Fetz and Baker. In this experiment monkeys
were rewarded for increasing the firing rate of a particular neuron in the
cortex and were able to solve this extremely difficult credit assignment
problem. Our model for this experiment relies on a combination of
reward-modulated STDP with variable spontaneous firing activity. Hence it also
provides a possible functional explanation for trial-to-trial variability, which
is characteristic for cortical networks of neurons but has no analogue in
currently existing artificial computing systems. In addition our model
demonstrates that reward-modulated STDP can be applied to all synapses in a
large recurrent neural network without endangering the stability of the network
dynamics
Deep Innovation Protection: Confronting the Credit Assignment Problem in Training Heterogeneous Neural Architectures
Deep reinforcement learning approaches have shown impressive results in a
variety of different domains, however, more complex heterogeneous architectures
such as world models require the different neural components to be trained
separately instead of end-to-end. While a simple genetic algorithm recently
showed end-to-end training is possible, it failed to solve a more complex 3D
task. This paper presents a method called Deep Innovation Protection (DIP) that
addresses the credit assignment problem in training complex heterogenous neural
network models end-to-end for such environments. The main idea behind the
approach is to employ multiobjective optimization to temporally reduce the
selection pressure on specific components in multi-component network, allowing
other components to adapt. We investigate the emergent representations of these
evolved networks, which learn to predict properties important for the survival
of the agent, without the need for a specific forward-prediction loss
- …