156,628 research outputs found
Efficient exploration with Double Uncertain Value Networks
This paper studies directed exploration for reinforcement learning agents by
tracking uncertainty about the value of each available action. We identify two
sources of uncertainty that are relevant for exploration. The first originates
from limited data (parametric uncertainty), while the second originates from
the distribution of the returns (return uncertainty). We identify methods to
learn these distributions with deep neural networks, where we estimate
parametric uncertainty with Bayesian drop-out, while return uncertainty is
propagated through the Bellman equation as a Gaussian distribution. Then, we
identify that both can be jointly estimated in one network, which we call the
Double Uncertain Value Network. The policy is directly derived from the learned
distributions based on Thompson sampling. Experimental results show that both
types of uncertainty may vastly improve learning in domains with a strong
exploration challenge.Comment: Deep Reinforcement Learning Symposium @ Conference on Neural
Information Processing Systems (NIPS) 201
Automatic Curriculum Learning For Deep RL: A Short Survey
Automatic Curriculum Learning (ACL) has become a cornerstone of recent
successes in Deep Reinforcement Learning (DRL).These methods shape the learning
trajectories of agents by challenging them with tasks adapted to their
capacities. In recent years, they have been used to improve sample efficiency
and asymptotic performance, to organize exploration, to encourage
generalization or to solve sparse reward problems, among others. The ambition
of this work is dual: 1) to present a compact and accessible introduction to
the Automatic Curriculum Learning literature and 2) to draw a bigger picture of
the current state of the art in ACL to encourage the cross-breeding of existing
concepts and the emergence of new ideas.Comment: Accepted at IJCAI202
Online Decision Making in Crowdsourcing Markets: Theoretical Challenges (Position Paper)
Over the past decade, crowdsourcing has emerged as a cheap and efficient
method of obtaining solutions to simple tasks that are difficult for computers
to solve but possible for humans. The popularity and promise of crowdsourcing
markets has led to both empirical and theoretical research on the design of
algorithms to optimize various aspects of these markets, such as the pricing
and assignment of tasks. Much of the existing theoretical work on crowdsourcing
markets has focused on problems that fall into the broad category of online
decision making; task requesters or the crowdsourcing platform itself make
repeated decisions about prices to set, workers to filter out, problems to
assign to specific workers, or other things. Often these decisions are complex,
requiring algorithms that learn about the distribution of available tasks or
workers over time and take into account the strategic (or sometimes irrational)
behavior of workers.
As human computation grows into its own field, the time is ripe to address
these challenges in a principled way. However, it appears very difficult to
capture all pertinent aspects of crowdsourcing markets in a single coherent
model. In this paper, we reflect on the modeling issues that inhibit
theoretical research on online decision making for crowdsourcing, and identify
some steps forward. This paper grew out of the authors' own frustration with
these issues, and we hope it will encourage the community to attempt to
understand, debate, and ultimately address them.
The authors welcome feedback for future revisions of this paper
When Simple Exploration is Sample Efficient: Identifying Sufficient Conditions for Random Exploration to Yield PAC RL Algorithms
Efficient exploration is one of the key challenges for reinforcement learning
(RL) algorithms. Most traditional sample efficiency bounds require strategic
exploration. Recently many deep RL algorithms with simple heuristic exploration
strategies that have few formal guarantees, achieve surprising success in many
domains. These results pose an important question about understanding these
exploration strategies such as -greedy, as well as understanding what
characterize the difficulty of exploration in MDPs. In this work we propose
problem specific sample complexity bounds of learning with random walk
exploration that rely on several structural properties. We also link our
theoretical results to some empirical benchmark domains, to illustrate if our
bound gives polynomial sample complexity in these domains and how that is
related with the empirical performance.Comment: Appeared in The 14th European Workshop on Reinforcement Learning
(EWRL), 201
Reinforcement Learning with Probabilistic Guarantees for Autonomous Driving
Designing reliable decision strategies for autonomous urban driving is
challenging. Reinforcement learning (RL) has been used to automatically derive
suitable behavior in uncertain environments, but it does not provide any
guarantee on the performance of the resulting policy. We propose a generic
approach to enforce probabilistic guarantees on an RL agent. An exploration
strategy is derived prior to training that constrains the agent to choose among
actions that satisfy a desired probabilistic specification expressed with
linear temporal logic (LTL). Reducing the search space to policies satisfying
the LTL formula helps training and simplifies reward design. This paper
outlines a case study of an intersection scenario involving multiple traffic
participants. The resulting policy outperforms a rule-based heuristic approach
in terms of efficiency while exhibiting strong guarantees on safety
Active Contextual Entropy Search
Contextual policy search allows adapting robotic movement primitives to
different situations. For instance, a locomotion primitive might be adapted to
different terrain inclinations or desired walking speeds. Such an adaptation is
often achievable by modifying a small number of hyperparameters. However,
learning, when performed on real robotic systems, is typically restricted to a
small number of trials. Bayesian optimization has recently been proposed as a
sample-efficient means for contextual policy search that is well suited under
these conditions. In this work, we extend entropy search, a variant of Bayesian
optimization, such that it can be used for active contextual policy search
where the agent selects those tasks during training in which it expects to
learn the most. Empirical results in simulation suggest that this allows
learning successful behavior with less trials.Comment: Corrected title of reference #1
Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization
Performance of machine learning algorithms depends critically on identifying
a good set of hyperparameters. While recent approaches use Bayesian
optimization to adaptively select configurations, we focus on speeding up
random search through adaptive resource allocation and early-stopping. We
formulate hyperparameter optimization as a pure-exploration non-stochastic
infinite-armed bandit problem where a predefined resource like iterations, data
samples, or features is allocated to randomly sampled configurations. We
introduce a novel algorithm, Hyperband, for this framework and analyze its
theoretical properties, providing several desirable guarantees. Furthermore, we
compare Hyperband with popular Bayesian optimization methods on a suite of
hyperparameter optimization problems. We observe that Hyperband can provide
over an order-of-magnitude speedup over our competitor set on a variety of
deep-learning and kernel-based learning problems.Comment: Changes: - Updated to JMLR versio
Langevin Dynamics with Continuous Tempering for Training Deep Neural Networks
Minimizing non-convex and high-dimensional objective functions is
challenging, especially when training modern deep neural networks. In this
paper, a novel approach is proposed which divides the training process into two
consecutive phases to obtain better generalization performance: Bayesian
sampling and stochastic optimization. The first phase is to explore the energy
landscape and to capture the "fat" modes; and the second one is to fine-tune
the parameter learned from the first phase. In the Bayesian learning phase, we
apply continuous tempering and stochastic approximation into the Langevin
dynamics to create an efficient and effective sampler, in which the temperature
is adjusted automatically according to the designed "temperature dynamics".
These strategies can overcome the challenge of early trapping into bad local
minima and have achieved remarkable improvements in various types of neural
networks as shown in our theoretical analysis and empirical experiments
Batch Policy Learning under Constraints
When learning policies for real-world domains, two important questions arise:
(i) how to efficiently use pre-collected off-policy, non-optimal behavior data;
and (ii) how to mediate among different competing objectives and constraints.
We thus study the problem of batch policy learning under multiple constraints,
and offer a systematic solution. We first propose a flexible meta-algorithm
that admits any batch reinforcement learning and online learning procedure as
subroutines. We then present a specific algorithmic instantiation and provide
performance guarantees for the main objective and all constraints. To certify
constraint satisfaction, we propose a new and simple method for off-policy
policy evaluation (OPE) and derive PAC-style bounds. Our algorithm achieves
strong empirical results in different domains, including in a challenging
problem of simulated car driving subject to multiple constraints such as lane
keeping and smooth driving. We also show experimentally that our OPE method
outperforms other popular OPE techniques on a standalone basis, especially in a
high-dimensional setting
Taming Non-stationary Bandits: A Bayesian Approach
We consider the multi armed bandit problem in non-stationary environments.
Based on the Bayesian method, we propose a variant of Thompson Sampling which
can be used in both rested and restless bandit scenarios. Applying discounting
to the parameters of prior distribution, we describe a way to systematically
reduce the effect of past observations. Further, we derive the exact expression
for the probability of picking sub-optimal arms. By increasing the exploitative
value of Bayes' samples, we also provide an optimistic version of the
algorithm. Extensive empirical analysis is conducted under various scenarios to
validate the utility of proposed algorithms. A comparison study with various
state-of-the-arm algorithms is also included.Comment: Submitted to NIPS 201
- …