Transformer-based pretrained large language models (PLM) such as BERT and GPT
have achieved remarkable success in NLP tasks. However, PLMs are prone to
encoding stereotypical biases. Although a burgeoning literature has emerged on
stereotypical bias mitigation in PLMs, such as work on debiasing gender and
racial stereotyping, how such biases manifest and behave internally within PLMs
remains largely unknown. Understanding the internal stereotyping mechanisms may
allow better assessment of model fairness and guide the development of
effective mitigation strategies. In this work, we focus on attention heads, a
major component of the Transformer architecture, and propose a bias analysis
framework to explore and identify a small set of biased heads that are found to
contribute to a PLM's stereotypical bias. We conduct extensive experiments to
validate the existence of these biased heads and to better understand how they
behave. We investigate gender and racial bias in the English language in two
types of Transformer-based PLMs: the encoder-based BERT model and the
decoder-based autoregressive GPT model. Overall, the results shed light on
understanding the bias behavior in pretrained language models