Wit is a form of rich interaction that is often grounded in a specific
situation (e.g., a comment in response to an event). In this work, we attempt
to build computational models that can produce witty descriptions for a given
image. Inspired by a cognitive account of humor appreciation, we employ
linguistic wordplay, specifically puns, in image descriptions. We develop two
approaches which involve retrieving witty descriptions for a given image from a
large corpus of sentences, or generating them via an encoder-decoder neural
network architecture. We compare our approach against meaningful baseline
approaches via human studies and show substantial improvements. We find that
when a human is subject to similar constraints as the model regarding word
usage and style, people vote the image descriptions generated by our model to
be slightly wittier than human-written witty descriptions. Unsurprisingly,
humans are almost always wittier than the model when they are free to choose
the vocabulary, style, etc.Comment: NAACL 2018 (11 pages