40,933 research outputs found
An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning
In this paper we introduce the idea of improving the performance of
parametric temporal-difference (TD) learning algorithms by selectively
emphasizing or de-emphasizing their updates on different time steps. In
particular, we show that varying the emphasis of linear TD()'s updates
in a particular way causes its expected update to become stable under
off-policy training. The only prior model-free TD methods to achieve this with
per-step computation linear in the number of function approximation parameters
are the gradient-TD family of methods including TDC, GTD(), and
GQ(). Compared to these methods, our _emphatic TD()_ is
simpler and easier to use; it has only one learned parameter vector and one
step-size parameter. Our treatment includes general state-dependent discounting
and bootstrapping functions, and a way of specifying varying degrees of
interest in accurately valuing different states.Comment: 29 pages This is a significant revision based on the first set of
reviews. The most important change was to signal early that the main result
is about stability, not convergenc
Reinforcement Learning of Speech Recognition System Based on Policy Gradient and Hypothesis Selection
Speech recognition systems have achieved high recognition performance for
several tasks. However, the performance of such systems is dependent on the
tremendously costly development work of preparing vast amounts of task-matched
transcribed speech data for supervised training. The key problem here is the
cost of transcribing speech data. The cost is repeatedly required to support
new languages and new tasks. Assuming broad network services for transcribing
speech data for many users, a system would become more self-sufficient and more
useful if it possessed the ability to learn from very light feedback from the
users without annoying them. In this paper, we propose a general reinforcement
learning framework for speech recognition systems based on the policy gradient
method. As a particular instance of the framework, we also propose a hypothesis
selection-based reinforcement learning method. The proposed framework provides
a new view for several existing training and adaptation methods. The
experimental results show that the proposed method improves the recognition
performance compared to unsupervised adaptation.Comment: 5 pages, 6 figure
- …