52,647 research outputs found
State-Augmentation Transformations for Risk-Sensitive Reinforcement Learning
In the framework of MDP, although the general reward function takes three
arguments-current state, action, and successor state; it is often simplified to
a function of two arguments-current state and action. The former is called a
transition-based reward function, whereas the latter is called a state-based
reward function. When the objective involves the expected cumulative reward
only, this simplification works perfectly. However, when the objective is
risk-sensitive, this simplification leads to an incorrect value. We present
state-augmentation transformations (SATs), which preserve the reward sequences
as well as the reward distributions and the optimal policy in risk-sensitive
reinforcement learning. In risk-sensitive scenarios, firstly we prove that, for
every MDP with a stochastic transition-based reward function, there exists an
MDP with a deterministic state-based reward function, such that for any given
(randomized) policy for the first MDP, there exists a corresponding policy for
the second MDP, such that both Markov reward processes share the same reward
sequence. Secondly we illustrate that two situations require the proposed SATs
in an inventory control problem. One could be using Q-learning (or other
learning methods) on MDPs with transition-based reward functions, and the other
could be using methods, which are for the Markov processes with a deterministic
state-based reward functions, on the Markov processes with general reward
functions. We show the advantage of the SATs by considering Value-at-Risk as an
example, which is a risk measure on the reward distribution instead of the
measures (such as mean and variance) of the distribution. We illustrate the
error in the reward distribution estimation from the direct use of Q-learning,
and show how the SATs enable a variance formula to work on Markov processes
with general reward functions
- …