Recent work in vision-and-language demonstrates that large-scale pretraining
can learn generalizable models that are efficiently transferable to downstream
tasks. While this may improve dataset-scale aggregate metrics, analyzing
performance around hand-crafted subgroups targeting specific bias dimensions
reveals systemic undesirable behaviors. However, this subgroup analysis is
frequently stalled by annotation efforts, which require extensive time and
resources to collect the necessary data. Prior art attempts to automatically
discover subgroups to circumvent these constraints but typically leverages
model behavior on existing task-specific annotations and rapidly degrades on
more complex inputs beyond "tabular" data, none of which study
vision-and-language models. This paper presents VLSlice, an interactive system
enabling user-guided discovery of coherent representation-level subgroups with
consistent visiolinguistic behavior, denoted as vision-and-language slices,
from unlabeled image sets. We show that VLSlice enables users to quickly
generate diverse high-coherency slices in a user study (n=22) and release the
tool publicly.Comment: Conference paper at ICCV 2023. 17 pages, 11 figures.
https://ericslyman.com/vlslice