TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias

Abstract

We identify a critical bias in contemporary CLIP-based models, which we denote as single tag bias. This bias manifests as a disproportionate focus on a singular tag (word) while neglecting other pertinent tags, stemming from CLIP embeddings prioritizing one specific tag in image-text relationships. In this paper, we introduce a novel two-step fine-tuning approach, Text-Tag Self-Distillation (TTD), to address this challenge. We first extract all image-relevant tags from text based on their similarity to the nearest pixels. Then, we distill a combined mask containing the extracted tags' content to a text-derived mask. This approach ensures the unbiased image-text alignment of the CLIP-based models using only image-text pairs without necessitating additional supervision. Our technique demonstrates model-agnostic improvements in multitag classification and segmentation tasks, surpassing competing methods that rely on external resources. The code and data are available at https://github.com/shjo-april/TTD.N

Similar works

Full text

thumbnail-image

SNU Open Repository and Archive

redirect
Last time updated on 17/06/2025

This paper was published in SNU Open Repository and Archive.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.