End-to-end text spotting is a vital computer vision task that aims to
integrate scene text detection and recognition into a unified framework.
Typical methods heavily rely on Region-of-Interest (RoI) operations to extract
local features and complex post-processing steps to produce final predictions.
To address these limitations, we propose TextFormer, a query-based end-to-end
text spotter with Transformer architecture. Specifically, using query embedding
per text instance, TextFormer builds upon an image encoder and a text decoder
to learn a joint semantic understanding for multi-task modeling. It allows for
mutual training and optimization of classification, segmentation, and
recognition branches, resulting in deeper feature sharing without sacrificing
flexibility or simplicity. Additionally, we design an Adaptive Global
aGgregation (AGG) module to transfer global features into sequential features
for reading arbitrarily-shaped texts, which overcomes the sub-optimization
problem of RoI operations. Furthermore, potential corpus information is
utilized from weak annotations to full labels through mixed supervision,
further improving text detection and end-to-end text spotting results.
Extensive experiments on various bilingual (i.e., English and Chinese)
benchmarks demonstrate the superiority of our method. Especially on TDA-ReCTS
dataset, TextFormer surpasses the state-of-the-art method in terms of 1-NED by
13.2%.Comment: MIR 2023, 15 page