5 research outputs found
RL-LIM: Reinforcement Learning-based Locally Interpretable Modeling
Understanding black-box machine learning models is important towards their
widespread adoption. However, developing globally interpretable models that
explain the behavior of the entire model is challenging. An alternative
approach is to explain black-box models through explaining individual
prediction using a locally interpretable model. In this paper, we propose a
novel method for locally interpretable modeling - Reinforcement Learning-based
Locally Interpretable Modeling (RL-LIM). RL-LIM employs reinforcement learning
to select a small number of samples and distill the black-box model prediction
into a low-capacity locally interpretable model. Training is guided with a
reward that is obtained directly by measuring agreement of the predictions from
the locally interpretable model with the black-box model. RL-LIM near-matches
the overall prediction performance of black-box models while yielding
human-like interpretability, and significantly outperforms state of the art
locally interpretable models in terms of overall prediction performance and
fidelity.Comment: 18 pages, 7 figures, 7 table