19 research outputs found
TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision
End-to-end text spotting is a vital computer vision task that aims to
integrate scene text detection and recognition into a unified framework.
Typical methods heavily rely on Region-of-Interest (RoI) operations to extract
local features and complex post-processing steps to produce final predictions.
To address these limitations, we propose TextFormer, a query-based end-to-end
text spotter with Transformer architecture. Specifically, using query embedding
per text instance, TextFormer builds upon an image encoder and a text decoder
to learn a joint semantic understanding for multi-task modeling. It allows for
mutual training and optimization of classification, segmentation, and
recognition branches, resulting in deeper feature sharing without sacrificing
flexibility or simplicity. Additionally, we design an Adaptive Global
aGgregation (AGG) module to transfer global features into sequential features
for reading arbitrarily-shaped texts, which overcomes the sub-optimization
problem of RoI operations. Furthermore, potential corpus information is
utilized from weak annotations to full labels through mixed supervision,
further improving text detection and end-to-end text spotting results.
Extensive experiments on various bilingual (i.e., English and Chinese)
benchmarks demonstrate the superiority of our method. Especially on TDA-ReCTS
dataset, TextFormer surpasses the state-of-the-art method in terms of 1-NED by
13.2%.Comment: MIR 2023, 15 page
Benchmarking Chinese Text Recognition: Datasets, Baselines, and an Empirical Study
The flourishing blossom of deep learning has witnessed the rapid development
of text recognition in recent years. However, the existing text recognition
methods are mainly proposed for English texts. As another widely-spoken
language, Chinese text recognition (CTR) in all ways has extensive application
markets. Based on our observations, we attribute the scarce attention on CTR to
the lack of reasonable dataset construction standards, unified evaluation
protocols, and results of the existing baselines. To fill this gap, we manually
collect CTR datasets from publicly available competitions, projects, and
papers. According to application scenarios, we divide the collected datasets
into four categories including scene, web, document, and handwriting datasets.
Besides, we standardize the evaluation protocols in CTR. With unified
evaluation protocols, we evaluate a series of representative text recognition
methods on the collected datasets to provide baselines. The experimental
results indicate that the performance of baselines on CTR datasets is not as
good as that on English datasets due to the characteristics of Chinese texts
that are quite different from the Latin alphabet. Moreover, we observe that by
introducing radical-level supervision as an auxiliary task, the performance of
baselines can be further boosted. The code and datasets are made publicly
available at https://github.com/FudanVI/benchmarking-chinese-text-recognitionComment: Code is available at
https://github.com/FudanVI/benchmarking-chinese-text-recognitio
SPTS: Single-Point Text Spotting
Existing scene text spotting (i.e., end-to-end text detection and
recognition) methods rely on costly bounding box annotations (e.g., text-line,
word-level, or character-level bounding boxes). For the first time, we
demonstrate that training scene text spotting models can be achieved with an
extremely low-cost annotation of a single-point for each instance. We propose
an end-to-end scene text spotting method that tackles scene text spotting as a
sequence prediction task. Given an image as input, we formulate the desired
detection and recognition results as a sequence of discrete tokens and use an
auto-regressive Transformer to predict the sequence. The proposed method is
simple yet effective, which can achieve state-of-the-art results on widely used
benchmarks. Most significantly, we show that the performance is not very
sensitive to the positions of the point annotation, meaning that it can be much
easier to be annotated or even be automatically generated than the bounding box
that requires precise positions. We believe that such a pioneer attempt
indicates a significant opportunity for scene text spotting applications of a
much larger scale than previously possible. The code will be publicly
available
Seq-UPS: Sequential Uncertainty-aware Pseudo-label Selection for Semi-Supervised Text Recognition
This paper looks at semi-supervised learning (SSL) for image-based text
recognition. One of the most popular SSL approaches is pseudo-labeling (PL). PL
approaches assign labels to unlabeled data before re-training the model with a
combination of labeled and pseudo-labeled data. However, PL methods are
severely degraded by noise and are prone to over-fitting to noisy labels, due
to the inclusion of erroneous high confidence pseudo-labels generated from
poorly calibrated models, thus, rendering threshold-based selection
ineffective. Moreover, the combinatorial complexity of the hypothesis space and
the error accumulation due to multiple incorrect autoregressive steps posit
pseudo-labeling challenging for sequence models. To this end, we propose a
pseudo-label generation and an uncertainty-based data selection framework for
semi-supervised text recognition. We first use Beam-Search inference to yield
highly probable hypotheses to assign pseudo-labels to the unlabelled examples.
Then we adopt an ensemble of models, sampled by applying dropout, to obtain a
robust estimate of the uncertainty associated with the prediction, considering
both the character-level and word-level predictive distribution to select good
quality pseudo-labels. Extensive experiments on several benchmark handwriting
and scene-text datasets show that our method outperforms the baseline
approaches and the previous state-of-the-art semi-supervised text-recognition
methods.Comment: Accepted at WACV 202
Towards End-to-End Unified Scene Text Detection and Layout Analysis
Scene text detection and document layout analysis have long been treated as
two separate tasks in different image domains. In this paper, we bring them
together and introduce the task of unified scene text detection and layout
analysis. The first hierarchical scene text dataset is introduced to enable
this novel research task. We also propose a novel method that is able to
simultaneously detect scene text and form text clusters in a unified way.
Comprehensive experiments show that our unified model achieves better
performance than multiple well-designed baseline methods. Additionally, this
model achieves state-of-the-art results on multiple scene text detection
datasets without the need of complex post-processing. Dataset and code:
https://github.com/google-research-datasets/hiertext.Comment: To appear at CVPR 202
Technical Report on Web-based Visual Corpus Construction for Visual Document Understanding
We present a dataset generator engine named Web-based Visual Corpus Builder
(Webvicob). Webvicob can readily construct a large-scale visual corpus (i.e.,
images with text annotations) from a raw Wikipedia HTML dump. In this report,
we validate that Webvicob-generated data can cover a wide range of context and
knowledge and helps practitioners to build a powerful Visual Document
Understanding (VDU) backbone. The proposed engine is publicly available at
https://github.com/clovaai/webvicob