8 research outputs found
Causal Interpretation of Self-Attention in Pre-Trained Transformers
We propose a causal interpretation of self-attention in the Transformer
neural network architecture. We interpret self-attention as a mechanism that
estimates a structural equation model for a given input sequence of symbols
(tokens). The structural equation model can be interpreted, in turn, as a
causal structure over the input symbols under the specific context of the input
sequence. Importantly, this interpretation remains valid in the presence of
latent confounders. Following this interpretation, we estimate conditional
independence relations between input symbols by calculating partial
correlations between their corresponding representations in the deepest
attention layer. This enables learning the causal structure over an input
sequence using existing constraint-based algorithms. In this sense, existing
pre-trained Transformers can be utilized for zero-shot causal-discovery. We
demonstrate this method by providing causal explanations for the outcomes of
Transformers in two tasks: sentiment classification (NLP) and recommendation.Comment: 37th Conference on Neural Information Processing Systems (NeurIPS
2023). arXiv admin note: text overlap with arXiv:2210.1062
From Temporal to Contemporaneous Iterative Causal Discovery in the Presence of Latent Confounders
We present a constraint-based algorithm for learning causal structures from
observational time-series data, in the presence of latent confounders. We
assume a discrete-time, stationary structural vector autoregressive process,
with both temporal and contemporaneous causal relations. One may ask if
temporal and contemporaneous relations should be treated differently. The
presented algorithm gradually refines a causal graph by learning long-term
temporal relations before short-term ones, where contemporaneous relations are
learned last. This ordering of causal relations to be learnt leads to a
reduction in the required number of statistical tests. We validate this
reduction empirically and demonstrate that it leads to higher accuracy for
synthetic data and more plausible causal graphs for real-world data compared to
state-of-the-art algorithms.Comment: Proceedings of the 40-th International Conference on Machine Learning
(ICML), 202