10 research outputs found
Barycenters of Natural Images -- Constrained Wasserstein Barycenters for Image Morphing
Image interpolation, or image morphing, refers to a visual transition between
two (or more) input images. For such a transition to look visually appealing,
its desirable properties are (i) to be smooth; (ii) to apply the minimal
required change in the image; and (iii) to seem "real", avoiding unnatural
artifacts in each image in the transition. To obtain a smooth and
straightforward transition, one may adopt the well-known Wasserstein Barycenter
Problem (WBP). While this approach guarantees minimal changes under the
Wasserstein metric, the resulting images might seem unnatural. In this work, we
propose a novel approach for image morphing that possesses all three desired
properties. To this end, we define a constrained variant of the WBP that
enforces the intermediate images to satisfy an image prior. We describe an
algorithm that solves this problem and demonstrate it using the sparse prior
and generative adversarial networks
Sequence-to-Sequence Contrastive Learning for Text Recognition
We propose a framework for sequence-to-sequence contrastive learning (SeqCLR) of visual representations, which we apply to text recognition. To account for the sequence-to-sequence structure, each feature map is divided into different instances over which the contrastive loss is computed. This operation enables us to contrast in a sub-word level, where from each image we extract several positive pairs and multiple negative examples. To yield effective visual representations for text recognition, we further suggest novel augmentation heuristics, different encoder architectures and custom projection heads. Experiments on handwritten text and on scene text show that when a text decoder is trained on the learned representations, our method outperforms non-sequential contrastive methods. In addition, when the amount of supervision is reduced, SeqCLR significantly improves performance compared with supervised training, and when fine-tuned with 100% of the labels, our method achieves state-of-the-art results on standard handwritten text recognition benchmarks
CLIPTER: Looking at the Bigger Picture in Scene Text Recognition
Reading text in real-world scenarios often requires understanding the context
surrounding it, especially when dealing with poor-quality text. However,
current scene text recognizers are unaware of the bigger picture as they
operate on cropped text images. In this study, we harness the representative
capabilities of modern vision-language models, such as CLIP, to provide
scene-level information to the crop-based recognizer. We achieve this by fusing
a rich representation of the entire image, obtained from the vision-language
model, with the recognizer word-level features via a gated cross-attention
mechanism. This component gradually shifts to the context-enhanced
representation, allowing for stable fine-tuning of a pretrained recognizer. We
demonstrate the effectiveness of our model-agnostic framework, CLIPTER (CLIP
TExt Recognition), on leading text recognition architectures and achieve
state-of-the-art results across multiple benchmarks. Furthermore, our analysis
highlights improved robustness to out-of-vocabulary words and enhanced
generalization in low-data regimes.Comment: Accepted for publication by ICCV 202