Today, using Large-scale generative Language Models (LLMs) it is possible to
simulate free responses to interview questions like those traditionally
analyzed using qualitative research methods. Qualitative methodology
encompasses a broad family of techniques involving manual analysis of
open-ended interviews or conversations conducted freely in natural language.
Here we consider whether artificial "silicon participants" generated by LLMs
may be productively studied using qualitative methods aiming to produce
insights that could generalize to real human populations. The key concept in
our analysis is algorithmic fidelity, a term introduced by Argyle et al. (2023)
capturing the degree to which LLM-generated outputs mirror human
sub-populations' beliefs and attitudes. By definition, high algorithmic
fidelity suggests latent beliefs elicited from LLMs may generalize to real
humans, whereas low algorithmic fidelity renders such research invalid. Here we
used an LLM to generate interviews with silicon participants matching specific
demographic characteristics one-for-one with a set of human participants. Using
framework-based qualitative analysis, we showed the key themes obtained from
both human and silicon participants were strikingly similar. However, when we
analyzed the structure and tone of the interviews we found even more striking
differences. We also found evidence of the hyper-accuracy distortion described
by Aher et al. (2023). We conclude that the LLM we tested (GPT-3.5) does not
have sufficient algorithmic fidelity to expect research on it to generalize to
human populations. However, the rapid pace of LLM research makes it plausible
this could change in the future. Thus we stress the need to establish epistemic
norms now around how to assess validity of LLM-based qualitative research,
especially concerning the need to ensure representation of heterogeneous lived
experiences.Comment: 46 pages, 5 tables, 5 figure