Commonsense norms are defeasible by context: reading books is usually great,
but not when driving a car. While contexts can be explicitly described in
language, in embodied scenarios, contexts are often provided visually. This
type of visually grounded reasoning about defeasible commonsense norms is
generally easy for humans, but (as we show) poses a challenge for machines, as
it necessitates both visual understanding and reasoning about commonsense
norms. We construct a new multimodal benchmark for studying visual-grounded
commonsense norms: NORMLENS. NORMLENS consists of 10K human judgments
accompanied by free-form explanations covering 2K multimodal situations, and
serves as a probe to address two questions: (1) to what extent can models align
with average human judgment? and (2) how well can models explain their
predicted judgments? We find that state-of-the-art model judgments and
explanations are not well-aligned with human annotation. Additionally, we
present a new approach to better align models with humans by distilling social
commonsense knowledge from large language models. The data and code are
released at https://seungjuhan.me/normlens.Comment: Published as a conference paper at EMNLP 2023 (long