In the wake of the explosive growth of machine learning (ML) usage,
particularly within the context of emerging Large Language Models (LLMs),
comprehending the semantic significance rooted in their internal workings is
crucial. While causal analyses focus on defining semantics and its
quantification, the gradient-based approach is central to explainable AI (XAI),
tackling the interpretation of the black box. By synergizing these approaches,
the exploration of how a model's internal mechanisms illuminate its causal
effect has become integral for evidence-based decision-making. A parallel line
of research has revealed that intersectionality - the combinatory impact of
multiple demographics of an individual - can be structured in the form of an
Averaged Treatment Effect (ATE). Initially, this study illustrates that the
hateful memes detection problem can be formulated as an ATE, assisted by the
principles of intersectionality, and that a modality-wise summarization of
gradient-based attention attribution scores can delineate the distinct
behaviors of three Transformerbased models concerning ATE. Subsequently, we
show that the latest LLM LLaMA2 has the ability to disentangle the
intersectional nature of memes detection in an in-context learning setting,
with their mechanistic properties elucidated via meta-gradient, a secondary
form of gradient. In conclusion, this research contributes to the ongoing
dialogue surrounding XAI and the multifaceted nature of ML models