Many online services, such as search engines, social media platforms, and
digital marketplaces, are advertised as being available to any user, regardless
of their age, gender, or other demographic factors. However, there are growing
concerns that these services may systematically underserve some groups of
users. In this paper, we present a framework for internally auditing such
services for differences in user satisfaction across demographic groups, using
search engines as a case study. We first explain the pitfalls of na\"ively
comparing the behavioral metrics that are commonly used to evaluate search
engines. We then propose three methods for measuring latent differences in user
satisfaction from observed differences in evaluation metrics. To develop these
methods, we drew on ideas from the causal inference literature and the
multilevel modeling literature. Our framework is broadly applicable to other
online services, and provides general insight into interpreting their
evaluation metrics.Comment: 8 pages Accepted at WWW 201