If you ask a human to describe an image, they might do so in a thousand
different ways. Traditionally, image captioning models are trained to generate
a single "best" (most like a reference) image caption. Unfortunately, doing so
encourages captions that are "informationally impoverished," and focus on only
a subset of the possible details, while ignoring other potentially useful
information in the scene. In this work, we introduce a simple, yet novel,
method: "Image Captioning by Committee Consensus" (IC3), designed to generate a
single caption that captures high-level details from several annotator
viewpoints. Humans rate captions produced by IC3 at least as helpful as
baseline SOTA models more than two thirds of the time, and IC3 can improve the
performance of SOTA automated recall systems by up to 84%, outperforming single
human-generated reference captions, and indicating significant improvements
over SOTA approaches for visual description. Code is available at
https://davidmchan.github.io/caption-by-committee/Comment: To Appear at EMNLP 202