256 research outputs found
Reinforcement Learning for Markovian Bandits: Is Posterior Sampling more Scalable than Optimism?
We study learning algorithms for the classical Markovian bandit problem with
discount. We explain how to adapt PSRL [24] and UCRL2 [2] to exploit the
problem structure. These variants are called MB-PSRL and MB-UCRL2. While the
regret bound and runtime of vanilla implementations of PSRL and UCRL2 are
exponential in the number of bandits, we show that the episodic regret of
MB-PSRL and MB-UCRL2 is where is the number of
episodes, is the number of bandits and is the number of states of each
bandit (the exact bound in S, n and K is given in the paper). Up to a factor
, this matches the lower bound of that we also
derive in the paper. MB-PSRL is also computationally efficient: its runtime is
linear in the number of bandits. We further show that this linear runtime
cannot be achieved by adapting classical non-Bayesian algorithms such as UCRL2
or UCBVI to Markovian bandit problems. Finally, we perform numerical
experiments that confirm that MB-PSRL outperforms other existing algorithms in
practice, both in terms of regret and of computation time
Feature Markov Decision Processes
General purpose intelligent learning agents cycle through (complex,non-MDP)
sequences of observations, actions, and rewards. On the other hand,
reinforcement learning is well-developed for small finite state Markov Decision
Processes (MDPs). So far it is an art performed by human designers to extract
the right state representation out of the bare observations, i.e. to reduce the
agent setup to the MDP framework. Before we can think of mechanizing this
search for suitable MDPs, we need a formal objective criterion. The main
contribution of this article is to develop such a criterion. I also integrate
the various parts into one learning algorithm. Extensions to more realistic
dynamic Bayesian networks are developed in a companion article.Comment: 7 page
Efficiently Finding Approximately-Optimal Queries for Improving Policies and Guaranteeing Safety
When a computational agent (called the “robot”) takes actions on behalf of a human user, it may be uncertain about the human’s preferences. The human may initially specify her preferences incompletely or inaccurately. In this case, the robot’s performance may be unsatisfactory or even cause negative side effects to the environment. There are approaches in the literature that may solve this problem. For example, the human can provide some demonstrations which clarify the robot’s uncertainty. The human may give real-time feedback to the robot’s behavior, or monitor the robot and stop the robot when it may perform anything dangerous. However, these methods typically require much of the human’s attention. Alternatively, the robot may estimate the human’s true preferences using the specified preferences, but this is error-prone and requires making assumptions on how the human specifies her preferences.
In this thesis, I consider a querying approach. Before taking any actions, the robot has a chance to query the human about her preferences. For example, the robot may query the human about which trajectory in a set of trajectories she likes the most, or whether the human cares about some side effects to the domain. After the human responds to the query, the robot expects to improve its performance and/or guarantee that its behavior is considered safe by the human.
If we do not impose any constraint on the number of queries the robot can pose, the robot may keep posing queries until it is absolutely certain about the human’s preferences. This may consume too much of the human’s cognitive load. The information obtained in the responses to some of the queries may only marginally improve the robot’s performance, which is not worth the human’s attention at all. So in the problems considered in this thesis, I constrain the number of queries that the robot can pose, or associate each query with a cost. The research question is how to efficiently find the most useful query under such constraints.
Finding a provably optimal query can be challenging since it is usually a combinatorial optimization problem. In this thesis, I contribute to providing efficient query selection algorithms under uncertainty. I first formulate the robot’s uncertainty as reward uncertainty and safety-constraint uncertainty. Under only reward uncertainty, I provide a query selection algorithm that finds approximately-optimal k-response queries. Under only safety-constraint uncertainty, I provide a query selection algorithm that finds an optimal k-element query to improve a known safe policy, and an algorithm that uses a set-cover-based query selection strategy to find an initial safe policy. Under both types of uncertainty simultaneously, I provide a batch-query-based querying method that empirically outperforms other baseline querying methods.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163125/1/shunzh_1.pd
- …