442 research outputs found
Perseus: Randomized Point-based Value Iteration for POMDPs
Partially observable Markov decision processes (POMDPs) form an attractive
and principled framework for agent planning under uncertainty. Point-based
approximate techniques for POMDPs compute a policy based on a finite set of
points collected in advance from the agents belief space. We present a
randomized point-based value iteration algorithm called Perseus. The algorithm
performs approximate value backup stages, ensuring that in each backup stage
the value of each point in the belief set is improved; the key observation is
that a single backup may improve the value of many belief points. Contrary to
other point-based methods, Perseus backs up only a (randomly selected) subset
of points in the belief set, sufficient for improving the value of each belief
point in the set. We show how the same idea can be extended to dealing with
continuous action spaces. Experimental results show the potential of Perseus in
large scale POMDP problems
Hardware-Efficient Scalable Reinforcement Learning Systems
Reinforcement Learning (RL) is a machine learning discipline in which an agent learns by interacting with its environment. In this paradigm, the agent is required to perceive its state and take actions accordingly. Upon taking each action, a numerical reward is provided by the environment. The goal of the agent is thus to maximize the aggregate rewards it receives over time. Over the past two decades, a large variety of algorithms have been proposed to select actions in order to explore the environment and gradually construct an e¤ective strategy that maximizes the rewards. These RL techniques have been successfully applied to numerous real-world, complex applications including board games and motor control tasks.
Almost all RL algorithms involve the estimation of a value function, which indicates how good it is for the agent to be in a given state, in terms of the total expected reward in the long run. Alternatively, the value function may re‡ect on the impact of taking a particular action at a given state. The most fundamental approach for constructing such a value function consists of updating a table that contains a value for each state (or each state-action pair). However, this approach is impractical for large scale problems, in which the state and/or action spaces are large. In order to deal with such problems, it is necessary to exploit the generalization capabilities of non-linear function approximators, such as arti…cial neural networks.
This dissertation focuses on practical methodologies for solving reinforcement learning problems with large state and/or action spaces. In particular, the work addresses scenarios in which an agent does not have full knowledge of its state, but rather receives partial information about its environment via sensory-based observations. In order to address such intricate problems, novel solutions for both tabular and function-approximation based RL frameworks are proposed. A resource-efficient recurrent neural network algorithm is presented, which exploits adaptive step-size techniques to improve learning characteristics. Moreover, a consolidated actor-critic network is introduced, which omits the modeling redundancy found in typical actor-critic systems. Pivotal concerns are the scalability and speed of the learning algorithms, for which we devise architectures that map efficiently to hardware. As a result, a high degree of parallelism can be achieved. Simulation results that correspond to relevant testbench problems clearly demonstrate the solid performance attributes of the proposed solutions
Deep Variational Reinforcement Learning for POMDPs
Many real-world sequential decision making problems are partially observable
by nature, and the environment model is typically unknown. Consequently, there
is great need for reinforcement learning methods that can tackle such problems
given only a stream of incomplete and noisy observations. In this paper, we
propose deep variational reinforcement learning (DVRL), which introduces an
inductive bias that allows an agent to learn a generative model of the
environment and perform inference in that model to effectively aggregate the
available information. We develop an n-step approximation to the evidence lower
bound (ELBO), allowing the model to be trained jointly with the policy. This
ensures that the latent state representation is suitable for the control task.
In experiments on Mountain Hike and flickering Atari we show that our method
outperforms previous approaches relying on recurrent neural networks to encode
the past
Crossmodal Attentive Skill Learner
This paper presents the Crossmodal Attentive Skill Learner (CASL), integrated
with the recently-introduced Asynchronous Advantage Option-Critic (A2OC)
architecture [Harb et al., 2017] to enable hierarchical reinforcement learning
across multiple sensory inputs. We provide concrete examples where the approach
not only improves performance in a single task, but accelerates transfer to new
tasks. We demonstrate the attention mechanism anticipates and identifies useful
latent features, while filtering irrelevant sensor modalities during execution.
We modify the Arcade Learning Environment [Bellemare et al., 2013] to support
audio queries, and conduct evaluations of crossmodal learning in the Atari 2600
game Amidar. Finally, building on the recent work of Babaeizadeh et al. [2017],
we open-source a fast hybrid CPU-GPU implementation of CASL.Comment: International Conference on Autonomous Agents and Multiagent Systems
(AAMAS) 2018, NIPS 2017 Deep Reinforcement Learning Symposiu
Stochastic Shortest Path with Energy Constraints in POMDPs
We consider partially observable Markov decision processes (POMDPs) with a
set of target states and positive integer costs associated with every
transition. The traditional optimization objective (stochastic shortest path)
asks to minimize the expected total cost until the target set is reached. We
extend the traditional framework of POMDPs to model energy consumption, which
represents a hard constraint. The energy levels may increase and decrease with
transitions, and the hard constraint requires that the energy level must remain
positive in all steps till the target is reached. First, we present a novel
algorithm for solving POMDPs with energy levels, developing on existing POMDP
solvers and using RTDP as its main method. Our second contribution is related
to policy representation. For larger POMDP instances the policies computed by
existing solvers are too large to be understandable. We present an automated
procedure based on machine learning techniques that automatically extracts
important decisions of the policy allowing us to compute succinct human
readable policies. Finally, we show experimentally that our algorithm performs
well and computes succinct policies on a number of POMDP instances from the
literature that were naturally enhanced with energy levels.Comment: Technical report accompanying a paper published in proceedings of
AAMAS 201
- …