Image-Text matching (ITM) is a common task for evaluating the quality of
Vision and Language (VL) models. However, existing ITM benchmarks have a
significant limitation. They have many missing correspondences, originating
from the data construction process itself. For example, a caption is only
matched with one image although the caption can be matched with other similar
images, and vice versa. To correct the massive false negatives, we construct
the Extended COCO Validation (ECCV) Caption dataset by supplying the missing
associations with machine and human annotators. We employ five state-of-the-art
ITM models with diverse properties for our annotation process. Our dataset
provides x3.6 positive image-to-caption associations and x8.5 caption-to-image
associations compared to the original MS-COCO. We also propose to use an
informative ranking-based metric, rather than the popular Recall@K(R@K). We
re-evaluate the existing 25 VL models on existing and proposed benchmarks. Our
findings are that the existing benchmarks, such as COCO 1K R@K, COCO 5K R@K,
CxC R@1 are highly correlated with each other, while the rankings change when
we shift to the ECCV mAP. Lastly, we delve into the effect of the bias
introduced by the choice of machine annotator. Source code and dataset are
available at https://github.com/naver-ai/eccv-captionComment: 30 pages (1.7MB); Source code and dataset are available at
https://github.com/naver-ai/eccv-caption; v2 fixes minor typo