110 research outputs found
Design of the topology for contrastive visual-textual alignment
Cosine similarity is the common choice for measuring the distance between the
feature representations in contrastive visual-textual alignment learning.
However, empirically a learnable softmax temperature parameter is required when
learning on large-scale noisy training data. In this work, we first discuss the
role of softmax temperature from the embedding space's topological properties.
We argue that the softmax temperature is the key mechanism for contrastive
learning on noisy training data. It acts as a scaling factor of the distance
range (e.g. [-1, 1] for the cosine similarity), and its learned value indicates
the level of noise in the training data. Then, we propose an alternative design
of the topology for the embedding alignment. We make use of multiple class
tokens in the transformer architecture; then map the feature representations
onto an oblique manifold endowed with the negative inner product as the
distance function. With this configuration, we largely improve the zero-shot
classification performance of baseline CLIP models pre-trained on large-scale
datasets by an average of 6.1\%.Comment: https://github.com/minogame/clip-mto
- …