3 research outputs found
What's a Good Prediction? Issues in Evaluating General Value Functions Through Error
Constructing and maintaining knowledge of the world is a central problem for
artificial intelligence research. Approaches to constructing an agent's
knowledge using predictions have received increased amounts of interest in
recent years. A particularly promising collection of research centres itself
around architectures that formulate predictions as General Value Functions
(GVFs), an approach commonly referred to as \textit{predictive knowledge}. A
pernicious challenge for predictive knowledge architectures is determining what
to predict. In this paper, we argue that evaluation methods---i.e., return
error and RUPEE---are not well suited for the challenges of determining what to
predict. As a primary contribution, we provide extended examples that evaluate
predictions in terms of how they are used in further prediction tasks: a key
motivation of predictive knowledge systems. We demonstrate that simply because
a GVF's error is low, it does not necessarily follow the prediction is useful
as a cumulant. We suggest evaluating 1) the relevance of a GVF's features to
the prediction task at hand, and 2) evaluation of GVFs by \textit{how} they are
used. To determine feature relevance, we generalize AutoStep to GTD, producing
a step-size learning method suited to the life-long continual learning settings
that predictive knowledge architectures are commonly deployed in. This paper
contributes a first look into evaluation of predictions through their use, an
integral component of predictive knowledge which is as of yet explored.Comment: Submitted to AAMA
Preferential Temporal Difference Learning
Temporal-Difference (TD) learning is a general and very useful tool for
estimating the value function of a given policy, which in turn is required to
find good policies. Generally speaking, TD learning updates states whenever
they are visited. When the agent lands in a state, its value can be used to
compute the TD-error, which is then propagated to other states. However, it may
be interesting, when computing updates, to take into account other information
than whether a state is visited or not. For example, some states might be more
important than others (such as states which are frequently seen in a successful
trajectory). Or, some states might have unreliable value estimates (for
example, due to partial observability or lack of data), making their values
less desirable as targets. We propose an approach to re-weighting states used
in TD updates, both when they are the input and when they provide the target
for the update. We prove that our approach converges with linear function
approximation and illustrate its desirable empirical behaviour compared to
other TD-style methods.Comment: Accepted at the 38th International Conference on Machine Learning
(ICML, 2021
An Empirical Comparison of Off-policy Prediction Learning Algorithms on the Collision Task
Off-policy prediction -- learning the value function for one policy from data
generated while following another policy -- is one of the most challenging
subproblems in reinforcement learning. This paper presents empirical results
with eleven prominent off-policy learning algorithms that use linear function
approximation: five Gradient-TD methods, two Emphatic-TD methods, Off-policy
TD(), Vtrace, and versions of Tree Backup and ABQ modified to apply to
a prediction setting. Our experiments used the Collision task, a small
idealized off-policy problem analogous to that of an autonomous car trying to
predict whether it will collide with an obstacle. We assessed the performance
of the algorithms according to their learning rate, asymptotic error level, and
sensitivity to step-size and bootstrapping parameters. By these measures, the
eleven algorithms can be partially ordered on the Collision task. In the top
tier, the two Emphatic-TD algorithms learned the fastest, reached the lowest
errors, and were robust to parameter settings. In the middle tier, the five
Gradient-TD algorithms and Off-policy TD() were more sensitive to the
bootstrapping parameter. The bottom tier comprised Vtrace, Tree Backup, and
ABQ; these algorithms were no faster and had higher asymptotic error than the
others. Our results are definitive for this task, though of course experiments
with more tasks are needed before an overall assessment of the algorithms'
merits can be made