In this paper, we explore the potential of the Contrastive Language-Image
Pretraining (CLIP) model in scene text recognition (STR), and establish a novel
Symmetrical Linguistic Feature Distillation framework (named CLIP-OCR) to
leverage both visual and linguistic knowledge in CLIP. Different from previous
CLIP-based methods mainly considering feature generalization on visual
encoding, we propose a symmetrical distillation strategy (SDS) that further
captures the linguistic knowledge in the CLIP text encoder. By cascading the
CLIP image encoder with the reversed CLIP text encoder, a symmetrical structure
is built with an image-to-text feature flow that covers not only visual but
also linguistic information for distillation.Benefiting from the natural
alignment in CLIP, such guidance flow provides a progressive optimization
objective from vision to language, which can supervise the STR feature
forwarding process layer-by-layer.Besides, a new Linguistic Consistency Loss
(LCL) is proposed to enhance the linguistic capability by considering
second-order statistics during the optimization. Overall, CLIP-OCR is the first
to design a smooth transition between image and text for the STR task.Extensive
experiments demonstrate the effectiveness of CLIP-OCR with 93.8% average
accuracy on six popular STR benchmarks.Code will be available at
https://github.com/wzx99/CLIPOCR.Comment: Accepted by ACM MM 202