88 research outputs found
Incorporating granularity bias as the margin into contrastive loss for video captioning
Video captioning models easily suffer from long-tail distribution of phrases,
which makes captioning models prone to generate vague sentences instead of
accurate ones. However, existing debiasing strategies tend to export external
knowledge to build dependency trees of words or refine frequency distribution
by complex losses and extra input features, which lack interpretability and are
hard to train. To mitigate the impact of granularity bias on the model, we
introduced a statistical-based bias extractor. This extractor quantifies the
information content within sentences and videos, providing an estimate of the
likelihood that a video-sentence pair is affected by granularity bias.
Furthermore, with the growing trend of integrating contrastive learning methods
into video captioning tasks, we use a bidirectional triplet loss to get more
negative samples in a batch. Subsequently, we incorporate the margin score into
the contrastive learning loss, establishing distinct training objectives for
head and tail sentences. This approach facilitates the model's training
effectiveness on tail samples. Our simple yet effective loss, incorporating
Granularity bias, is referred to as the Margin-Contrastive Loss (GMC Loss). The
proposed model demonstrates state-of-the-art performance on MSRVTT with a CIDEr
of 57.17, and MSVD, where CIDEr reaches up to 138.68.Comment: 6 pages, 2 figure
- …