124 research outputs found
Active Inverse Reward Design
Designers of AI agents often iterate on the reward function in a
trial-and-error process until they get the desired behavior, but this only
guarantees good behavior in the training environment. We propose structuring
this process as a series of queries asking the user to compare between
different reward functions. Thus we can actively select queries for maximum
informativeness about the true reward. In contrast to approaches asking the
designer for optimal behavior, this allows us to gather additional information
by eliciting preferences between suboptimal behaviors. After each query, we
need to update the posterior over the true reward function from observing the
proxy reward function chosen by the designer. The recently proposed Inverse
Reward Design (IRD) enables this. Our approach substantially outperforms IRD in
test environments. In particular, it can query the designer about
interpretable, linear reward functions and still infer non-linear ones
Risk-averse Batch Active Inverse Reward Design
Designing a perfect reward function that depicts all the aspects of the
intended behavior is almost impossible, especially generalizing it outside of
the training environments. Active Inverse Reward Design (AIRD) proposed the use
of a series of queries, comparing possible reward functions in a single
training environment. This allows the human to give information to the agent
about suboptimal behaviors, in order to compute a probability distribution over
the intended reward function. However, it ignores the possibility of unknown
features appearing in real-world environments, and the safety measures needed
until the agent completely learns the reward function. I improved this method
and created Risk-averse Batch Active Inverse Reward Design (RBAIRD), which
constructs batches, sets of environments the agent encounters when being used
in the real world, processes them sequentially, and, for a predetermined number
of iterations, asks queries that the human needs to answer for each environment
of the batch. After this process is completed in one batch, the probabilities
have been improved and are transferred to the next batch. This makes it capable
of adapting to real-world scenarios and learning how to treat unknown features
it encounters for the first time. I also integrated a risk-averse planner,
similar to that of Inverse Reward Design (IRD), which samples a set of reward
functions from the probability distribution and computes a trajectory that
takes the most certain rewards possible. This ensures safety while the agent is
still learning the reward function, and enables the use of this approach in
situations where cautiousness is vital. RBAIRD outperformed the previous
approaches in terms of efficiency, accuracy, and action certainty, demonstrated
quick adaptability to new, unknown features, and can be more widely used for
the alignment of crucial, powerful AI models.Comment: 14 pages, 12 figure
- …