35,177 research outputs found
Hashing over Predicted Future Frames for Informed Exploration of Deep Reinforcement Learning
In deep reinforcement learning (RL) tasks, an efficient exploration mechanism
should be able to encourage an agent to take actions that lead to less frequent
states which may yield higher accumulative future return. However, both knowing
about the future and evaluating the frequentness of states are non-trivial
tasks, especially for deep RL domains, where a state is represented by
high-dimensional image frames. In this paper, we propose a novel informed
exploration framework for deep RL, where we build the capability for an RL
agent to predict over the future transitions and evaluate the frequentness for
the predicted future frames in a meaningful manner. To this end, we train a
deep prediction model to predict future frames given a state-action pair, and a
convolutional autoencoder model to hash over the seen frames. In addition, to
utilize the counts derived from the seen frames to evaluate the frequentness
for the predicted frames, we tackle the challenge of matching the predicted
future frames and their corresponding seen frames at the latent feature level.
In this way, we derive a reliable metric for evaluating the novelty of the
future direction pointed by each action, and hence inform the agent to explore
the least frequent one
Behavioural correlate of choice confidence in a discrete trial paradigm
How animals make choices in a changing and often uncertain environment is a central theme in the behavioural sciences. There is a substantial literature on how animals make choices in various experimental paradigms but less is known about the way they assess a choice after it has been made in terms of the expected outcome. Here, we used a discrete trial paradigm to characterise how the reward history shaped the behaviour on a trial by trial basis. Rats initiated each trial which consisted of a choice between two drinking spouts that differed in their probability of delivering a sucrose solution. Critically, sucrose was delivered after a delay from the first lick at the spouts--this allowed us to characterise the behavioural profile during the window between the time of choice and its outcome. Rats' behaviour converged to optimum choice, both during the acquisition phase and after the reversal of contingencies. We monitored the post-choice behaviour at a temporal precision of 1 millisecond; lick-response profiles revealed that rats spent more time at the spout with the higher reward probability and exhibited a sparser lick pattern. This was the case when we exclusively examined the unrewarded trials, where the outcome was identical. The differential licking profiles preceded the differential choice ratios and could thus predict the changes in choice behaviour.This research was supported by the Australian Research Council Discovery Project Grant DP0987133 to EA
Decentralized Learning for Multi-player Multi-armed Bandits
We consider the problem of distributed online learning with multiple players
in multi-armed bandits (MAB) models. Each player can pick among multiple arms.
When a player picks an arm, it gets a reward. We consider both i.i.d. reward
model and Markovian reward model. In the i.i.d. model each arm is modelled as
an i.i.d. process with an unknown distribution with an unknown mean. In the
Markovian model, each arm is modelled as a finite, irreducible, aperiodic and
reversible Markov chain with an unknown probability transition matrix and
stationary distribution. The arms give different rewards to different players.
If two players pick the same arm, there is a "collision", and neither of them
get any reward. There is no dedicated control channel for coordination or
communication among the players. Any other communication between the users is
costly and will add to the regret. We propose an online index-based distributed
learning policy called algorithm that trades off
\textit{exploration v. exploitation} in the right way, and achieves expected
regret that grows at most as near-. The motivation comes from
opportunistic spectrum access by multiple secondary users in cognitive radio
networks wherein they must pick among various wireless channels that look
different to different users. This is the first distributed learning algorithm
for multi-player MABs to the best of our knowledge.Comment: 33 pages, 3 figures. Submitted to IEEE Transactions on Information
Theor
- …