21 research outputs found
PBFormer: Capturing Complex Scene Text Shape with Polynomial Band Transformer
We present PBFormer, an efficient yet powerful scene text detector that
unifies the transformer with a novel text shape representation Polynomial Band
(PB). The representation has four polynomial curves to fit a text's top,
bottom, left, and right sides, which can capture a text with a complex shape by
varying polynomial coefficients. PB has appealing features compared with
conventional representations: 1) It can model different curvatures with a fixed
number of parameters, while polygon-points-based methods need to utilize a
different number of points. 2) It can distinguish adjacent or overlapping texts
as they have apparent different curve coefficients, while segmentation-based or
points-based methods suffer from adhesive spatial positions. PBFormer combines
the PB with the transformer, which can directly generate smooth text contours
sampled from predicted curves without interpolation. A parameter-free
cross-scale pixel attention (CPA) module is employed to highlight the feature
map of a suitable scale while suppressing the other feature maps. The simple
operation can help detect small-scale texts and is compatible with the
one-stage DETR framework, where no postprocessing exists for NMS. Furthermore,
PBFormer is trained with a shape-contained loss, which not only enforces the
piecewise alignment between the ground truth and the predicted curves but also
makes curves' positions and shapes consistent with each other. Without bells
and whistles about text pre-training, our method is superior to the previous
state-of-the-art text detectors on the arbitrary-shaped text datasets.Comment: 9 pages, 8 figures, accepted by ACM MM 202
Technical Report on Web-based Visual Corpus Construction for Visual Document Understanding
We present a dataset generator engine named Web-based Visual Corpus Builder
(Webvicob). Webvicob can readily construct a large-scale visual corpus (i.e.,
images with text annotations) from a raw Wikipedia HTML dump. In this report,
we validate that Webvicob-generated data can cover a wide range of context and
knowledge and helps practitioners to build a powerful Visual Document
Understanding (VDU) backbone. The proposed engine is publicly available at
https://github.com/clovaai/webvicob
SRFormer: Empowering Regression-Based Text Detection Transformer with Segmentation
Existing techniques for text detection can be broadly classified into two
primary groups: segmentation-based methods and regression-based methods.
Segmentation models offer enhanced robustness to font variations but require
intricate post-processing, leading to high computational overhead.
Regression-based methods undertake instance-aware prediction but face
limitations in robustness and data efficiency due to their reliance on
high-level representations. In our academic pursuit, we propose SRFormer, a
unified DETR-based model with amalgamated Segmentation and Regression, aiming
at the synergistic harnessing of the inherent robustness in segmentation
representations, along with the straightforward post-processing of
instance-level regression. Our empirical analysis indicates that favorable
segmentation predictions can be obtained at the initial decoder layers. In
light of this, we constrain the incorporation of segmentation branches to the
first few decoder layers and employ progressive regression refinement in
subsequent layers, achieving performance gains while minimizing additional
computational load from the mask. Furthermore, we propose a Mask-informed Query
Enhancement module. We take the segmentation result as a natural soft-ROI to
pool and extract robust pixel representations, which are then employed to
enhance and diversify instance queries. Extensive experimentation across
multiple benchmarks has yielded compelling findings, highlighting our method's
exceptional robustness, superior training and data efficiency, as well as its
state-of-the-art performance