Pre-trained language models (PLMs) have outperformed other NLP models on a
wide range of tasks. Opting for a more thorough understanding of their
capabilities and inner workings, researchers have established the extend to
which they capture lower-level knowledge like grammaticality, and mid-level
semantic knowledge like factual understanding. However, there is still little
understanding of their knowledge of higher-level aspects of language. In
particular, despite the importance of sociodemographic aspects in shaping our
language, the questions of whether, where, and how PLMs encode these aspects,
e.g., gender or age, is still unexplored. We address this research gap by
probing the sociodemographic knowledge of different single-GPU PLMs on multiple
English data sets via traditional classifier probing and information-theoretic
minimum description length probing. Our results show that PLMs do encode these
sociodemographics, and that this knowledge is sometimes spread across the
layers of some of the tested PLMs. We further conduct a multilingual analysis
and investigate the effect of supplementary training to further explore to what
extent, where, and with what amount of pre-training data the knowledge is
encoded. Our overall results indicate that sociodemographic knowledge is still
a major challenge for NLP. PLMs require large amounts of pre-training data to
acquire the knowledge and models that excel in general language understanding
do not seem to own more knowledge about these aspects.Comment: Accepted for publication at EMNLP 202