64 research outputs found
Symmetrical Linguistic Feature Distillation with CLIP for Scene Text Recognition
In this paper, we explore the potential of the Contrastive Language-Image
Pretraining (CLIP) model in scene text recognition (STR), and establish a novel
Symmetrical Linguistic Feature Distillation framework (named CLIP-OCR) to
leverage both visual and linguistic knowledge in CLIP. Different from previous
CLIP-based methods mainly considering feature generalization on visual
encoding, we propose a symmetrical distillation strategy (SDS) that further
captures the linguistic knowledge in the CLIP text encoder. By cascading the
CLIP image encoder with the reversed CLIP text encoder, a symmetrical structure
is built with an image-to-text feature flow that covers not only visual but
also linguistic information for distillation.Benefiting from the natural
alignment in CLIP, such guidance flow provides a progressive optimization
objective from vision to language, which can supervise the STR feature
forwarding process layer-by-layer.Besides, a new Linguistic Consistency Loss
(LCL) is proposed to enhance the linguistic capability by considering
second-order statistics during the optimization. Overall, CLIP-OCR is the first
to design a smooth transition between image and text for the STR task.Extensive
experiments demonstrate the effectiveness of CLIP-OCR with 93.8% average
accuracy on six popular STR benchmarks.Code will be available at
https://github.com/wzx99/CLIPOCR.Comment: Accepted by ACM MM 202
MomentDiff: Generative Video Moment Retrieval from Random to Real
Video moment retrieval pursues an efficient and generalized solution to
identify the specific temporal segments within an untrimmed video that
correspond to a given language description. To achieve this goal, we provide a
generative diffusion-based framework called MomentDiff, which simulates a
typical human retrieval process from random browsing to gradual localization.
Specifically, we first diffuse the real span to random noise, and learn to
denoise the random noise to the original span with the guidance of similarity
between text and video. This allows the model to learn a mapping from arbitrary
random locations to real moments, enabling the ability to locate segments from
random initialization. Once trained, MomentDiff could sample random temporal
segments as initial guesses and iteratively refine them to generate an accurate
temporal boundary. Different from discriminative works (e.g., based on
learnable proposals or queries), MomentDiff with random initialized spans could
resist the temporal location biases from datasets. To evaluate the influence of
the temporal location biases, we propose two anti-bias datasets with location
distribution shifts, named Charades-STA-Len and Charades-STA-Mom. The
experimental results demonstrate that our efficient framework consistently
outperforms state-of-the-art methods on three public benchmarks, and exhibits
better generalization and robustness on the proposed anti-bias datasets. The
code, model, and anti-bias evaluation datasets are available at
https://github.com/IMCCretrieval/MomentDiff.Comment: 12 pages, 5 figure
DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations
The diffusion-based text-to-image model harbors immense potential in
transferring reference style. However, current encoder-based approaches
significantly impair the text controllability of text-to-image models while
transferring styles. In this paper, we introduce DEADiff to address this issue
using the following two strategies: 1) a mechanism to decouple the style and
semantics of reference images. The decoupled feature representations are first
extracted by Q-Formers which are instructed by different text descriptions.
Then they are injected into mutually exclusive subsets of cross-attention
layers for better disentanglement. 2) A non-reconstructive learning method. The
Q-Formers are trained using paired images rather than the identical target, in
which the reference image and the ground-truth image are with the same style or
semantics. We show that DEADiff attains the best visual stylization results and
optimal balance between the text controllability inherent in the text-to-image
model and style similarity to the reference image, as demonstrated both
quantitatively and qualitatively. Our project page is
https://tianhao-qi.github.io/DEADiff/.Comment: Accepted by CVPR 202
- …