Scene text recognition has been studied for decades due to its broad
applications. However, despite Chinese characters possessing different
characteristics from Latin characters, such as complex inner structures and
large categories, few methods have been proposed for Chinese Text Recognition
(CTR). Particularly, the characteristic of large categories poses challenges in
dealing with zero-shot and few-shot Chinese characters. In this paper, inspired
by the way humans recognize Chinese texts, we propose a two-stage framework for
CTR. Firstly, we pre-train a CLIP-like model through aligning printed character
images and Ideographic Description Sequences (IDS). This pre-training stage
simulates humans recognizing Chinese characters and obtains the canonical
representation of each character. Subsequently, the learned representations are
employed to supervise the CTR model, such that traditional single-character
recognition can be improved to text-line recognition through image-IDS
matching. To evaluate the effectiveness of the proposed method, we conduct
extensive experiments on both Chinese character recognition (CCR) and CTR. The
experimental results demonstrate that the proposed method performs best in CCR
and outperforms previous methods in most scenarios of the CTR benchmark. It is
worth noting that the proposed method can recognize zero-shot Chinese
characters in text images without fine-tuning, whereas previous methods require
fine-tuning when new classes appear. The code is available at
https://github.com/FudanVI/FudanOCR/tree/main/image-ids-CTR.Comment: ICCV 202