Deep reinforcement learning (DRL) has proven extremely useful in a large
variety of application domains. However, even successful DRL-based software can
exhibit highly undesirable behavior. This is due to DRL training being based on
maximizing a reward function, which typically captures general trends but
cannot precisely capture, or rule out, certain behaviors of the system. In this
paper, we propose a novel framework aimed at drastically reducing the
undesirable behavior of DRL-based software, while maintaining its excellent
performance. In addition, our framework can assist in providing engineers with
a comprehensible characterization of such undesirable behavior. Under the hood,
our approach is based on extracting decision tree classifiers from erroneous
state-action pairs, and then integrating these trees into the DRL training
loop, penalizing the system whenever it performs an error. We provide a
proof-of-concept implementation of our approach, and use it to evaluate the
technique on three significant case studies. We find that our approach can
extend existing frameworks in a straightforward manner, and incurs only a
slight overhead in training time. Further, it incurs only a very slight hit to
performance, or even in some cases - improves it, while significantly reducing
the frequency of undesirable behavior