There is an increasing interest in learning reward functions that model human
intent and human preferences. However, many frameworks use blackbox learning
methods that, while expressive, are difficult to interpret. We propose and
evaluate a novel approach for learning expressive and interpretable reward
functions from preferences using Differentiable Decision Trees (DDTs) for both
low- and high-dimensional state inputs. We explore and discuss the viability of
learning interpretable reward functions using DDTs by evaluating our algorithm
on Cartpole, Visual Gridworld environments, and Atari games. We provide
evidence that that the tree structure of our learned reward function is useful
in determining the extent to which a reward function is aligned with human
preferences. We visualize the learned reward DDTs and find that they are
capable of learning interpretable reward functions but that the discrete nature
of the trees hurts the performance of reinforcement learning at test time.
However, we also show evidence that using soft outputs (averaged over all leaf
nodes) results in competitive performance when compared with larger capacity
deep neural network reward functions