152 research outputs found
Towards End-to-End Unified Scene Text Detection and Layout Analysis
Scene text detection and document layout analysis have long been treated as
two separate tasks in different image domains. In this paper, we bring them
together and introduce the task of unified scene text detection and layout
analysis. The first hierarchical scene text dataset is introduced to enable
this novel research task. We also propose a novel method that is able to
simultaneously detect scene text and form text clusters in a unified way.
Comprehensive experiments show that our unified model achieves better
performance than multiple well-designed baseline methods. Additionally, this
model achieves state-of-the-art results on multiple scene text detection
datasets without the need of complex post-processing. Dataset and code:
https://github.com/google-research-datasets/hiertext.Comment: To appear at CVPR 202
PBFormer: Capturing Complex Scene Text Shape with Polynomial Band Transformer
We present PBFormer, an efficient yet powerful scene text detector that
unifies the transformer with a novel text shape representation Polynomial Band
(PB). The representation has four polynomial curves to fit a text's top,
bottom, left, and right sides, which can capture a text with a complex shape by
varying polynomial coefficients. PB has appealing features compared with
conventional representations: 1) It can model different curvatures with a fixed
number of parameters, while polygon-points-based methods need to utilize a
different number of points. 2) It can distinguish adjacent or overlapping texts
as they have apparent different curve coefficients, while segmentation-based or
points-based methods suffer from adhesive spatial positions. PBFormer combines
the PB with the transformer, which can directly generate smooth text contours
sampled from predicted curves without interpolation. A parameter-free
cross-scale pixel attention (CPA) module is employed to highlight the feature
map of a suitable scale while suppressing the other feature maps. The simple
operation can help detect small-scale texts and is compatible with the
one-stage DETR framework, where no postprocessing exists for NMS. Furthermore,
PBFormer is trained with a shape-contained loss, which not only enforces the
piecewise alignment between the ground truth and the predicted curves but also
makes curves' positions and shapes consistent with each other. Without bells
and whistles about text pre-training, our method is superior to the previous
state-of-the-art text detectors on the arbitrary-shaped text datasets.Comment: 9 pages, 8 figures, accepted by ACM MM 202
CT-Net: Arbitrary-Shaped Text Detection via Contour Transformer
Contour based scene text detection methods have rapidly developed recently,
but still suffer from inaccurate frontend contour initialization, multi-stage
error accumulation, or deficient local information aggregation. To tackle these
limitations, we propose a novel arbitrary-shaped scene text detection framework
named CT-Net by progressive contour regression with contour transformers.
Specifically, we first employ a contour initialization module that generates
coarse text contours without any post-processing. Then, we adopt contour
refinement modules to adaptively refine text contours in an iterative manner,
which are beneficial for context information capturing and progressive global
contour deformation. Besides, we propose an adaptive training strategy to
enable the contour transformers to learn more potential deformation paths, and
introduce a re-score mechanism that can effectively suppress false positives.
Extensive experiments are conducted on four challenging datasets, which
demonstrate the accuracy and efficiency of our CT-Net over state-of-the-art
methods. Particularly, CT-Net achieves F-measure of 86.1 at 11.2 frames per
second (FPS) and F-measure of 87.8 at 10.1 FPS for CTW1500 and Total-Text
datasets, respectively.Comment: This paper has been accepted by IEEE Transactions on Circuits and
Systems for Video Technolog
Towards Robust Real-Time Scene Text Detection: From Semantic to Instance Representation Learning
Due to the flexible representation of arbitrary-shaped scene text and simple
pipeline, bottom-up segmentation-based methods begin to be mainstream in
real-time scene text detection. Despite great progress, these methods show
deficiencies in robustness and still suffer from false positives and instance
adhesion. Different from existing methods which integrate multiple-granularity
features or multiple outputs, we resort to the perspective of representation
learning in which auxiliary tasks are utilized to enable the encoder to jointly
learn robust features with the main task of per-pixel classification during
optimization. For semantic representation learning, we propose global-dense
semantic contrast (GDSC), in which a vector is extracted for global semantic
representation, then used to perform element-wise contrast with the dense grid
features. To learn instance-aware representation, we propose to combine
top-down modeling (TDM) with the bottom-up framework to provide implicit
instance-level clues for the encoder. With the proposed GDSC and TDM, the
encoder network learns stronger representation without introducing any
parameters and computations during inference. Equipped with a very light
decoder, the detector can achieve more robust real-time scene text detection.
Experimental results on four public datasets show that the proposed method can
outperform or be comparable to the state-of-the-art on both accuracy and speed.
Specifically, the proposed method achieves 87.2% F-measure with 48.2 FPS on
Total-Text and 89.6% F-measure with 36.9 FPS on MSRA-TD500 on a single GeForce
RTX 2080 Ti GPU.Comment: Accepted by ACM MM 202
Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation
The Segment Anything Model (SAM), a profound vision foundation model
pre-trained on a large-scale dataset, breaks the boundaries of general
segmentation and sparks various downstream applications. This paper introduces
Hi-SAM, a unified model leveraging SAM for hierarchical text segmentation.
Hi-SAM excels in text segmentation across four hierarchies, including stroke,
word, text-line, and paragraph, while realizing layout analysis as well.
Specifically, we first turn SAM into a high-quality text stroke segmentation
(TSS) model through a parameter-efficient fine-tuning approach. We use this TSS
model to iteratively generate the text stroke labels in a semi-automatical
manner, unifying labels across the four text hierarchies in the HierText
dataset. Subsequently, with these complete labels, we launch the end-to-end
trainable Hi-SAM based on the TSS architecture with a customized hierarchical
mask decoder. During inference, Hi-SAM offers both automatic mask generation
(AMG) mode and promptable segmentation mode. In terms of the AMG mode, Hi-SAM
segments text stroke foreground masks initially, then samples foreground points
for hierarchical text mask generation and achieves layout analysis in passing.
As for the promptable mode, Hi-SAM provides word, text-line, and paragraph
masks with a single point click. Experimental results show the state-of-the-art
performance of our TSS model: 84.86% fgIOU on Total-Text and 88.96% fgIOU on
TextSeg for text stroke segmentation. Moreover, compared to the previous
specialist for joint hierarchical detection and layout analysis on HierText,
Hi-SAM achieves significant improvements: 4.73% PQ and 5.39% F1 on the
text-line level, 5.49% PQ and 7.39% F1 on the paragraph level layout analysis,
requiring 20x fewer training epochs. The code is available at
https://github.com/ymy-k/Hi-SAM.Comment: GitHub repository: https://github.com/ymy-k/Hi-SA
Tag recognition from panoramic scans of industrial facilities
CAD-based digital twins are commonly used by operators of process industry facilities to combine 3D models with external information and documentation. However, often a suitable model does not exist, and the plant operators must instead resort to laser scans with panoramic photos, which provide little to no metadata or information about their contents. Reading of equipment tags or other useful text from these scans could hugely increase their usefulness, as that information could be used to connect equipment to its documentation and other data. In this thesis, the feasibility of such extraction as a special case of deep learning text detection and recognition is studied.
This work contrasts practical requirements of industry with the theory and research behind text detection and recognition, with experiments conducted to confirm the feasibility of a potential application. It is found that the task is feasible from both business domain and deep learning perspectives. In practice, off-the-shelf text detection models generalize very well to the problem but integrating text recognition to build an end-to-end solution is found to require further work. End-to-end text recognition models appear promising in research, but rather uncommon in practical applications. Recent laser scans including color imagery are found suitable for the task and using them for recognition is found feasible; however, the usefulness of older scans remains unclear due to their poor quality. Deploying a successful practical solution is thus possible with modern scans but acquiring such scans may require collaboration with facility operators
- …