8 research outputs found
Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using Diffusion Models
Scene text detection techniques have garnered significant attention due to
their wide-ranging applications. However, existing methods have a high demand
for training data, and obtaining accurate human annotations is labor-intensive
and time-consuming. As a solution, researchers have widely adopted synthetic
text images as a complementary resource to real text images during
pre-training. Yet there is still room for synthetic datasets to enhance the
performance of scene text detectors. We contend that one main limitation of
existing generation methods is the insufficient integration of foreground text
with the background. To alleviate this problem, we present the Diffusion Model
based Text Generator (DiffText), a pipeline that utilizes the diffusion model
to seamlessly blend foreground text regions with the background's intrinsic
features. Additionally, we propose two strategies to generate visually coherent
text with fewer spelling errors. With fewer text instances, our produced text
images consistently surpass other synthetic data in aiding text detectors.
Extensive experiments on detecting horizontal, rotated, curved, and line-level
texts demonstrate the effectiveness of DiffText in producing realistic text
images
PGNet: Real-time Arbitrarily-Shaped Text Spotting with Point Gathering Network
The reading of arbitrarily-shaped text has received increasing research
attention. However, existing text spotters are mostly built on two-stage
frameworks or character-based methods, which suffer from either Non-Maximum
Suppression (NMS), Region-of-Interest (RoI) operations, or character-level
annotations. In this paper, to address the above problems, we propose a novel
fully convolutional Point Gathering Network (PGNet) for reading
arbitrarily-shaped text in real-time. The PGNet is a single-shot text spotter,
where the pixel-level character classification map is learned with proposed
PG-CTC loss avoiding the usage of character-level annotations. With a PG-CTC
decoder, we gather high-level character classification vectors from
two-dimensional space and decode them into text symbols without NMS and RoI
operations involved, which guarantees high efficiency. Additionally, reasoning
the relations between each character and its neighbors, a graph refinement
module (GRM) is proposed to optimize the coarse recognition and improve the
end-to-end performance. Experiments prove that the proposed method achieves
competitive accuracy, meanwhile significantly improving the running speed. In
particular, in Total-Text, it runs at 46.7 FPS, surpassing the previous
spotters with a large margin.Comment: 10 pages, 8 figures, AAAI 202
Investigation of Deep Learning Optimization Algorithms in Scene Text Detection
Scene text detection frameworks heavily rely on optimization methods for their successful operation. Choosing an appropriate optimizer is essential to performing recent scene text detection models. However, recent deep learning methods often employ various optimization algorithms and loss functions without explicitly explaining their selections. This paper presents a segmentation-based text detection pipeline capable of handling arbitrary-shaped text instances in wild images. We explore the effectiveness of well-known deep-learning optimizers to enhance the pipeline's capabilities. Additionally, we introduce a novel Segmentation-based Attention Module (SAM) that enables the model to capture long-range dependencies of multi-scale feature maps and focus more accurately on regions likely to contain text instances.The performance of the proposed architecture is extensively evaluated through ablation experiments, exploring the impact of different optimization algorithms and the introduced SAM block. Furthermore, we compare the final model against state-of-the-art scene text detection techniques on three publicly available benchmark datasets, namely ICDAR15, MSRA-TD500, and Total-Text. Our experimental results demonstrate that the focal loss combined with the Stochastic Gradient Descent (SGD) + Momentum optimizer with poly learning-rate policy achieves a more robust and generalized detection performance than other optimization strategies. Moreover, our utilized architecture, empowered by the proposed SAM block, significantly enhances the overall detection performance, achieving competitive H-mean detection scores while maintaining superior efficiency in terms of Frames Per Second (FPS) compared to recent techniques. Our findings shed light on the importance of selecting appropriate optimization strategies and demonstrate the effectiveness of our proposed Segmentation-based Attention Module in scene text detection tasks
TTS: Hilbert Transform-based Generative Adversarial Network for Tattoo and Scene Text Spotting
Text spotting in natural scenes is of increasing interest and significance due to its critical role in several applications, such as visual question answering, named entity recognition and event rumor detection on social media. One of the newly emerging challenging problems is Tattoo Text Spotting (TTS) in images for assisting forensic teams and for person identification. Unlike the generally simpler scene text addressed by current state-of-the-art methods, tattoo text is typically characterized by the presence of decorative backgrounds, calligraphic handwriting and several distortions due to the deformable nature of the skin. This paper describes the first approach to address TTS in a real-world application context by designing an end-to-end text spotting method employing a Hilbert transform-based Generative Adversarial Network (GAN). To reduce the complexity of the TTS task, the proposed approach first detects fine details in the image using the Hilbert transform and the Optimum Phase Congruency (OPC). To overcome the challenges of only having a relatively small number of training samples, a GAN is then used for generating suitable text samples and descriptors for text spotting (i.e. both detection and recognition). The superior performance of the proposed TTS approach, for both tattoo and general scene text, over the state-of-the-art methods is demonstrated on a new TTS-specific dataset (publicly available 1) as well as on the existing benchmark natural scene text datasets: Total-Text, CTW1500 and ICDAR 2015
Topological Deep Learning: Going Beyond Graph Data
Topological deep learning is a rapidly growing field that pertains to the
development of deep learning models for data supported on topological domains
such as simplicial complexes, cell complexes, and hypergraphs, which generalize
many domains encountered in scientific computations. In this paper, we present
a unifying deep learning framework built upon a richer data structure that
includes widely adopted topological domains.
Specifically, we first introduce combinatorial complexes, a novel type of
topological domain. Combinatorial complexes can be seen as generalizations of
graphs that maintain certain desirable properties. Similar to hypergraphs,
combinatorial complexes impose no constraints on the set of relations. In
addition, combinatorial complexes permit the construction of hierarchical
higher-order relations, analogous to those found in simplicial and cell
complexes. Thus, combinatorial complexes generalize and combine useful traits
of both hypergraphs and cell complexes, which have emerged as two promising
abstractions that facilitate the generalization of graph neural networks to
topological spaces.
Second, building upon combinatorial complexes and their rich combinatorial
and algebraic structure, we develop a general class of message-passing
combinatorial complex neural networks (CCNNs), focusing primarily on
attention-based CCNNs. We characterize permutation and orientation
equivariances of CCNNs, and discuss pooling and unpooling operations within
CCNNs in detail.
Third, we evaluate the performance of CCNNs on tasks related to mesh shape
analysis and graph learning. Our experiments demonstrate that CCNNs have
competitive performance as compared to state-of-the-art deep learning models
specifically tailored to the same tasks. Our findings demonstrate the
advantages of incorporating higher-order relations into deep learning models in
different applications