7 research outputs found
Adversarial Infidelity Learning for Model Interpretation
Model interpretation is essential in data mining and knowledge discovery. It
can help understand the intrinsic model working mechanism and check if the
model has undesired characteristics. A popular way of performing model
interpretation is Instance-wise Feature Selection (IFS), which provides an
importance score of each feature representing the data samples to explain how
the model generates the specific output. In this paper, we propose a
Model-agnostic Effective Efficient Direct (MEED) IFS framework for model
interpretation, mitigating concerns about sanity, combinatorial shortcuts,
model identifiability, and information transmission. Also, we focus on the
following setting: using selected features to directly predict the output of
the given model, which serves as a primary evaluation metric for
model-interpretation methods. Apart from the features, we involve the output of
the given model as an additional input to learn an explainer based on more
accurate information. To learn the explainer, besides fidelity, we propose an
Adversarial Infidelity Learning (AIL) mechanism to boost the explanation
learning by screening relatively unimportant features. Through theoretical and
experimental analysis, we show that our AIL mechanism can help learn the
desired conditional distribution between selected features and targets.
Moreover, we extend our framework by integrating efficient interpretation
methods as proper priors to provide a warm start. Comprehensive empirical
evaluation results are provided by quantitative metrics and human evaluation to
demonstrate the effectiveness and superiority of our proposed method. Our code
is publicly available online at https://github.com/langlrsw/MEED.Comment: 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
(KDD '20), August 23--27, 2020, Virtual Event, US
Representation Engineering: A Top-Down Approach to AI Transparency
In this paper, we identify and characterize the emerging area of
representation engineering (RepE), an approach to enhancing the transparency of
AI systems that draws on insights from cognitive neuroscience. RepE places
population-level representations, rather than neurons or circuits, at the
center of analysis, equipping us with novel methods for monitoring and
manipulating high-level cognitive phenomena in deep neural networks (DNNs). We
provide baselines and an initial analysis of RepE techniques, showing that they
offer simple yet effective solutions for improving our understanding and
control of large language models. We showcase how these methods can provide
traction on a wide range of safety-relevant problems, including honesty,
harmlessness, power-seeking, and more, demonstrating the promise of top-down
transparency research. We hope that this work catalyzes further exploration of
RepE and fosters advancements in the transparency and safety of AI systems.Comment: Code is available at
https://github.com/andyzoujm/representation-engineerin