12,167 research outputs found
Q-CP: Learning Action Values for Cooperative Planning
Research on multi-robot systems has demonstrated promising results in manifold applications and domains. Still, efficiently learning an effective robot behaviors is very difficult, due to unstructured scenarios, high uncertainties, and large state dimensionality (e.g. hyper-redundant and groups of robot). To alleviate this problem, we present Q-CP a cooperative model-based reinforcement learning algorithm, which exploits action values to both (1) guide the exploration of the state space and (2) generate effective policies. Specifically, we exploit Q-learning to attack the curse-of-dimensionality in the iterations of a Monte-Carlo Tree Search. We implement and evaluate Q-CP on different stochastic cooperative (general-sum) games: (1) a simple cooperative navigation problem among 3 robots, (2) a cooperation scenario between a pair of KUKA YouBots performing hand-overs, and (3) a coordination task between two mobile robots entering a door. The obtained results show the effectiveness of Q-CP in the chosen applications, where action values drive the exploration and reduce the computational demand of the planning process while achieving good performance
Simultaneous Perturbation Algorithms for Batch Off-Policy Search
We propose novel policy search algorithms in the context of off-policy, batch
mode reinforcement learning (RL) with continuous state and action spaces. Given
a batch collection of trajectories, we perform off-line policy evaluation using
an algorithm similar to that by [Fonteneau et al., 2010]. Using this
Monte-Carlo like policy evaluator, we perform policy search in a class of
parameterized policies. We propose both first order policy gradient and second
order policy Newton algorithms. All our algorithms incorporate simultaneous
perturbation estimates for the gradient as well as the Hessian of the
cost-to-go vector, since the latter is unknown and only biased estimates are
available. We demonstrate their practicality on a simple 1-dimensional
continuous state space problem
Bayesian Optimization for Adaptive MCMC
This paper proposes a new randomized strategy for adaptive MCMC using
Bayesian optimization. This approach applies to non-differentiable objective
functions and trades off exploration and exploitation to reduce the number of
potentially costly objective function evaluations. We demonstrate the strategy
in the complex setting of sampling from constrained, discrete and densely
connected probabilistic graphical models where, for each variation of the
problem, one needs to adjust the parameters of the proposal mechanism
automatically to ensure efficient mixing of the Markov chains.Comment: This paper contains 12 pages and 6 figures. A similar version of this
paper has been submitted to AISTATS 2012 and is currently under revie
Expected Policy Gradients
We propose expected policy gradients (EPG), which unify stochastic policy
gradients (SPG) and deterministic policy gradients (DPG) for reinforcement
learning. Inspired by expected sarsa, EPG integrates across the action when
estimating the gradient, instead of relying only on the action in the sampled
trajectory. We establish a new general policy gradient theorem, of which the
stochastic and deterministic policy gradient theorems are special cases. We
also prove that EPG reduces the variance of the gradient estimates without
requiring deterministic policies and, for the Gaussian case, with no
computational overhead. Finally, we show that it is optimal in a certain sense
to explore with a Gaussian policy such that the covariance is proportional to
the exponential of the scaled Hessian of the critic with respect to the
actions. We present empirical results confirming that this new form of
exploration substantially outperforms DPG with the Ornstein-Uhlenbeck heuristic
in four challenging MuJoCo domains.Comment: Conference paper, AAAI-18, 12 pages including supplemen
MQLV: Optimal Policy of Money Management in Retail Banking with Q-Learning
Reinforcement learning has become one of the best approach to train a
computer game emulator capable of human level performance. In a reinforcement
learning approach, an optimal value function is learned across a set of
actions, or decisions, that leads to a set of states giving different rewards,
with the objective to maximize the overall reward. A policy assigns to each
state-action pairs an expected return. We call an optimal policy a policy for
which the value function is optimal. QLBS, Q-Learner in the
Black-Scholes(-Merton) Worlds, applies the reinforcement learning concepts, and
noticeably, the popular Q-learning algorithm, to the financial stochastic model
of Black, Scholes and Merton. It is, however, specifically optimized for the
geometric Brownian motion and the vanilla options. Its range of application is,
therefore, limited to vanilla option pricing within financial markets. We
propose MQLV, Modified Q-Learner for the Vasicek model, a new reinforcement
learning approach that determines the optimal policy of money management based
on the aggregated financial transactions of the clients. It unlocks new
frontiers to establish personalized credit card limits or to fulfill bank loan
applications, targeting the retail banking industry. MQLV extends the
simulation to mean reverting stochastic diffusion processes and it uses a
digital function, a Heaviside step function expressed in its discrete form, to
estimate the probability of a future event such as a payment default. In our
experiments, we first show the similarities between a set of historical
financial transactions and Vasicek generated transactions and, then, we
underline the potential of MQLV on generated Monte Carlo simulations. Finally,
MQLV is the first Q-learning Vasicek-based methodology addressing transparent
decision making processes in retail banking
- …