To explain "black-box" properties of AI models, many approaches, such as post
hoc and intrinsically interpretable models, have been proposed to provide
plausible explanations that identify human-understandable features/concepts
that a trained model uses to make predictions, and attention mechanisms have
been widely used to aid in model interpretability by visualizing that
information. However, the problem of configuring an interpretable model that
effectively communicates and coordinates among computational modules has
received less attention. A recently proposed shared global workspace theory
demonstrated that networks of distributed modules can benefit from sharing
information with a bandwidth-limited working memory because the communication
constraints encourage specialization, compositionality, and synchronization
among the modules. Inspired by this, we consider how such shared working
memories can be realized to build intrinsically interpretable models with
better interpretability and performance. Toward this end, we propose
Concept-Centric Transformers, a simple yet effective configuration of the
shared global workspace for interpretability consisting of: i) an
object-centric-based architecture for extracting semantic concepts from input
features, ii) a cross-attention mechanism between the learned concept and input
embeddings, and iii) standard classification and additional explanation losses
to allow human analysts to directly assess an explanation for the model's
classification reasoning. We test our approach against other existing
concept-based methods on classification tasks for various datasets, including
CIFAR100 (super-classes), CUB-200-2011 (bird species), and ImageNet, and we
show that our model achieves better classification accuracy than all selected
methods across all problems but also generates more consistent concept-based
explanations of classification output.Comment: 21 pages, 9 tables, 13 figure