The increasing size and complexity of modern ML systems has improved their
predictive capabilities but made their behavior harder to explain. Many
techniques for model explanation have been developed in response, but we lack
clear criteria for assessing these techniques. In this paper, we cast model
explanation as the causal inference problem of estimating causal effects of
real-world concepts on the output behavior of ML models given actual input
data. We introduce CEBaB, a new benchmark dataset for assessing concept-based
explanation methods in Natural Language Processing (NLP). CEBaB consists of
short restaurant reviews with human-generated counterfactual reviews in which
an aspect (food, noise, ambiance, service) of the dining experience was
modified. Original and counterfactual reviews are annotated with
multiply-validated sentiment ratings at the aspect-level and review-level. The
rich structure of CEBaB allows us to go beyond input features to study the
effects of abstract, real-world concepts on model behavior. We use CEBaB to
compare the quality of a range of concept-based explanation methods covering
different assumptions and conceptions of the problem, and we seek to establish
natural metrics for comparative assessments of these methods