This paper presents OxfordTVG-HIC (Humorous Image Captions), a large-scale
dataset for humour generation and understanding. Humour is an abstract,
subjective, and context-dependent cognitive construct involving several
cognitive factors, making it a challenging task to generate and interpret.
Hence, humour generation and understanding can serve as a new task for
evaluating the ability of deep-learning methods to process abstract and
subjective information. Due to the scarcity of data, humour-related generation
tasks such as captioning remain under-explored. To address this gap,
OxfordTVG-HIC offers approximately 2.9M image-text pairs with humour scores to
train a generalizable humour captioning model. Contrary to existing captioning
datasets, OxfordTVG-HIC features a wide range of emotional and semantic
diversity resulting in out-of-context examples that are particularly conducive
to generating humour. Moreover, OxfordTVG-HIC is curated devoid of offensive
content. We also show how OxfordTVG-HIC can be leveraged for evaluating the
humour of a generated text. Through explainability analysis of the trained
models, we identify the visual and linguistic cues influential for evoking
humour prediction (and generation). We observe qualitatively that these cues
are aligned with the benign violation theory of humour in cognitive psychology.Comment: Accepted by ICCV 202