This paper presents a novel approach to Single-Positive Multi-label Learning.
In general multi-label learning, a model learns to predict multiple labels or
categories for a single input image. This is in contrast with standard
multi-class image classification, where the task is predicting a single label
from many possible labels for an image. Single-Positive Multi-label Learning
(SPML) specifically considers learning to predict multiple labels when there is
only a single annotation per image in the training data. Multi-label learning
is in many ways a more realistic task than single-label learning as real-world
data often involves instances belonging to multiple categories simultaneously;
however, most common computer vision datasets predominantly contain single
labels due to the inherent complexity and cost of collecting multiple high
quality annotations for each instance. We propose a novel approach called
Vision-Language Pseudo-Labeling (VLPL), which uses a vision-language model to
suggest strong positive and negative pseudo-labels, and outperforms the current
SOTA methods by 5.5% on Pascal VOC, 18.4% on MS-COCO, 15.2% on NUS-WIDE, and
8.4% on CUB-Birds. Our code and data are available at
https://github.com/mvrl/VLPL