118,543 research outputs found
An Analysis of the Value of Information when Exploring Stochastic, Discrete Multi-Armed Bandits
In this paper, we propose an information-theoretic exploration strategy for
stochastic, discrete multi-armed bandits that achieves optimal regret. Our
strategy is based on the value of information criterion. This criterion
measures the trade-off between policy information and obtainable rewards. High
amounts of policy information are associated with exploration-dominant searches
of the space and yield high rewards. Low amounts of policy information favor
the exploitation of existing knowledge. Information, in this criterion, is
quantified by a parameter that can be varied during search. We demonstrate that
a simulated-annealing-like update of this parameter, with a sufficiently fast
cooling schedule, leads to an optimal regret that is logarithmic with respect
to the number of episodes.Comment: Entrop
Anytime Point-Based Approximations for Large POMDPs
The Partially Observable Markov Decision Process has long been recognized as
a rich framework for real-world planning and control problems, especially in
robotics. However exact solutions in this framework are typically
computationally intractable for all but the smallest problems. A well-known
technique for speeding up POMDP solving involves performing value backups at
specific belief points, rather than over the entire belief simplex. The
efficiency of this approach, however, depends greatly on the selection of
points. This paper presents a set of novel techniques for selecting informative
belief points which work well in practice. The point selection procedure is
combined with point-based value backups to form an effective anytime POMDP
algorithm called Point-Based Value Iteration (PBVI). The first aim of this
paper is to introduce this algorithm and present a theoretical analysis
justifying the choice of belief selection technique. The second aim of this
paper is to provide a thorough empirical comparison between PBVI and other
state-of-the-art POMDP methods, in particular the Perseus algorithm, in an
effort to highlight their similarities and differences. Evaluation is performed
using both standard POMDP domains and realistic robotic tasks
Recommended from our members
Learning to Act with RVRL Agents
The use of reinforcement learning to guide action selection of cognitive agents has been shown to be a powerful technique for stochastic environments. Standard Reinforcement learning techniques used to provide decision theoretic policies rely, however, on explicit state-based computations of value for each state-action pair. This requires the computation of a number of values exponential to the number of state variables and actions in the system. This research extends existing work with an acquired probabilistic rule representation of an agent environment by developing an algorithm to apply reinforcement learning to values attached to the rules themselves. Structure captured by the rules is then used to learn a policy directly. The resulting value attached to each rule represents the utility of taking an action if the conditions of the rule are present in the agent’s current set of percepts. This has several advantages for planning purposes: generalization over many states and over unseen states; effective decisions can therefore be made with less training data than state based modelling systems (e.g. Dyna Q-Learning); and the problem of computation in an exponential state-action space is alleviated. The results of application of this algorithm to rules in a specific environment are presented, with comparison to standard reinforcement learning policies developed from related work
Non-stationary Stochastic Optimization
We consider a non-stationary variant of a sequential stochastic optimization
problem, in which the underlying cost functions may change along the horizon.
We propose a measure, termed variation budget, that controls the extent of said
change, and study how restrictions on this budget impact achievable
performance. We identify sharp conditions under which it is possible to achieve
long-run-average optimality and more refined performance measures such as rate
optimality that fully characterize the complexity of such problems. In doing
so, we also establish a strong connection between two rather disparate strands
of literature: adversarial online convex optimization; and the more traditional
stochastic approximation paradigm (couched in a non-stationary setting). This
connection is the key to deriving well performing policies in the latter, by
leveraging structure of optimal policies in the former. Finally, tight bounds
on the minimax regret allow us to quantify the "price of non-stationarity,"
which mathematically captures the added complexity embedded in a temporally
changing environment versus a stationary one
- …