Evaluation of large language models within GenAI in qualitative research

Awiti, Enid; Bhaumik, Runa; Mason, Linda; Mehta, Supriya D.; Otieno, Fredrick O.; Paul, Souvik; Phillips-Howard, Penelope A.; Young, Sophie; Zulaika, Garazi

research article

Evaluation of large language models within GenAI in qualitative research

Authors: Enid Awiti
Runa Bhaumik
Linda Mason
Supriya D. Mehta
Fredrick O. Otieno
Souvik Paul
Penelope A. Phillips-Howard
Sophie Young
Garazi Zulaika
Publication date: 7 October 2025
Publisher
Doi

Abstract

Large language models (LLMs) perform tasks such as summarizing information and analyzing sentiment to generate meaningful and natural responses. The application of GenAI incorporating LLMs raises potential utilities for conducting qualitative research. Using a qualitative study that assessed the impact of the COVID-19 pandemic on the sexual and reproductive health of adolescent girls and young women (AGYW) in rural western Kenya: our objective was to compare thematic analyses conducted by GenAI using LLM to qualitative analysis conducted by humans, with regards to major themes identified, selection of supportive quotes, and quality of quotes; and secondarily to explore quantitative and qualitative sentiment analysis conducted by the GenAI. We interfaced with GPT-4o through google colaboratory. After inputting the transcripts and pre-processing, we constructed a standardized task prompt. Two investigators independently reviewed the GenAI product using a rubric based on qualitative research standards. When compared to human-derived themes, we did not find disagreement with the sub-themes raised by GenAI, but did not consider some to rise to level of a theme. Performance was low and variable with regards to selection of quotes that were consistent with and strongly supportive of thematic and sentiment analysis. Hallucinations ranged from a single word or phrase change to truncation or combinations of text that led to modified meaning. GenAI identified numerous and relevant biases, primarily related to the underlying training data and its lack of cultural understanding. Few prior studies have directly compared LLM-driven thematic coding with human coding in qualitative analysis, and our study - grounded in qualitative study rigor - allowed for a thorough evaluation. GenAI implemented in GPT-4o was unable to provide a thematic analysis that is indistinguishable from a human analysis. We recommend that it can currently be used as an aid in identifying themes, keywords, and basic narrative, and potentially as a check for human error or bias. However, until it can eliminate hallucinations, provide better contextual understanding of quotes and undertake a deeper scrutiny of data, it is not reliable or sophisticated enough to undertake a rigorous thematic analysis equal in quality to experienced qualitative researchers.</p

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

LSTM Research Portal

oai:pure.atira.dk:publications...

Last time updated on 20/10/2025

LSTM Research Portal

oai:pure.atira.dk:openaire/a69...

Last time updated on 20/10/2025