5,278 research outputs found
Flexible Model Interpretability through Natural Language Model Editing
Model interpretability and model editing are crucial goals in the age of
large language models. Interestingly, there exists a link between these two
goals: if a method is able to systematically edit model behavior with regard to
a human concept of interest, this editor method can help make internal
representations more interpretable by pointing towards relevant representations
and systematically manipulating them.Comment: Extended Abstract -- work in progress. BlackboxNLP202
Model Interpretability through the Lens of Computational Complexity
In spite of several claims stating that some models are more interpretable
than others -- e.g., "linear models are more interpretable than deep neural
networks" -- we still lack a principled notion of interpretability to formally
compare among different classes of models. We make a step towards such a notion
by studying whether folklore interpretability claims have a correlate in terms
of computational complexity theory. We focus on local post-hoc explainability
queries that, intuitively, attempt to answer why individual inputs are
classified in a certain way by a given model. In a nutshell, we say that a
class of models is more interpretable than another class
, if the computational complexity of answering post-hoc queries
for models in is higher than for those in . We
prove that this notion provides a good theoretical counterpart to current
beliefs on the interpretability of models; in particular, we show that under
our definition and assuming standard complexity-theoretical assumptions (such
as PNP), both linear and tree-based models are strictly more
interpretable than neural networks. Our complexity analysis, however, does not
provide a clear-cut difference between linear and tree-based models, as we
obtain different results depending on the particular post-hoc explanations
considered. Finally, by applying a finer complexity analysis based on
parameterized complexity, we are able to prove a theoretical result suggesting
that shallow neural networks are more interpretable than deeper ones.Comment: 36 pages, including 9 pages of main text. This is the arXiv version
of the NeurIPS'2020 paper. Except from minor differences that could be
introduced by the publisher, the only difference should be the addition of
the appendix, which contains all the proofs that do not appear in the main
tex
- …