8 research outputs found

    Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using Diffusion Models

    Full text link
    Scene text detection techniques have garnered significant attention due to their wide-ranging applications. However, existing methods have a high demand for training data, and obtaining accurate human annotations is labor-intensive and time-consuming. As a solution, researchers have widely adopted synthetic text images as a complementary resource to real text images during pre-training. Yet there is still room for synthetic datasets to enhance the performance of scene text detectors. We contend that one main limitation of existing generation methods is the insufficient integration of foreground text with the background. To alleviate this problem, we present the Diffusion Model based Text Generator (DiffText), a pipeline that utilizes the diffusion model to seamlessly blend foreground text regions with the background's intrinsic features. Additionally, we propose two strategies to generate visually coherent text with fewer spelling errors. With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors. Extensive experiments on detecting horizontal, rotated, curved, and line-level texts demonstrate the effectiveness of DiffText in producing realistic text images

    PGNet: Real-time Arbitrarily-Shaped Text Spotting with Point Gathering Network

    Full text link
    The reading of arbitrarily-shaped text has received increasing research attention. However, existing text spotters are mostly built on two-stage frameworks or character-based methods, which suffer from either Non-Maximum Suppression (NMS), Region-of-Interest (RoI) operations, or character-level annotations. In this paper, to address the above problems, we propose a novel fully convolutional Point Gathering Network (PGNet) for reading arbitrarily-shaped text in real-time. The PGNet is a single-shot text spotter, where the pixel-level character classification map is learned with proposed PG-CTC loss avoiding the usage of character-level annotations. With a PG-CTC decoder, we gather high-level character classification vectors from two-dimensional space and decode them into text symbols without NMS and RoI operations involved, which guarantees high efficiency. Additionally, reasoning the relations between each character and its neighbors, a graph refinement module (GRM) is proposed to optimize the coarse recognition and improve the end-to-end performance. Experiments prove that the proposed method achieves competitive accuracy, meanwhile significantly improving the running speed. In particular, in Total-Text, it runs at 46.7 FPS, surpassing the previous spotters with a large margin.Comment: 10 pages, 8 figures, AAAI 202

    Investigation of Deep Learning Optimization Algorithms in Scene Text Detection

    Get PDF
    Scene text detection frameworks heavily rely on optimization methods for their successful operation. Choosing an appropriate optimizer is essential to performing recent scene text detection models. However, recent deep learning methods often employ various optimization algorithms and loss functions without explicitly explaining their selections. This paper presents a segmentation-based text detection pipeline capable of handling arbitrary-shaped text instances in wild images. We explore the effectiveness of well-known deep-learning optimizers to enhance the pipeline's capabilities. Additionally, we introduce a novel Segmentation-based Attention Module (SAM) that enables the model to capture long-range dependencies of multi-scale feature maps and focus more accurately on regions likely to contain text instances.The performance of the proposed architecture is extensively evaluated through ablation experiments, exploring the impact of different optimization algorithms and the introduced SAM block. Furthermore, we compare the final model against state-of-the-art scene text detection techniques on three publicly available benchmark datasets, namely ICDAR15, MSRA-TD500, and Total-Text. Our experimental results demonstrate that the focal loss combined with the Stochastic Gradient Descent (SGD) + Momentum optimizer with poly learning-rate policy achieves a more robust and generalized detection performance than other optimization strategies. Moreover, our utilized architecture, empowered by the proposed SAM block, significantly enhances the overall detection performance, achieving competitive H-mean detection scores while maintaining superior efficiency in terms of Frames Per Second (FPS) compared to recent techniques. Our findings shed light on the importance of selecting appropriate optimization strategies and demonstrate the effectiveness of our proposed Segmentation-based Attention Module in scene text detection tasks

    TTS: Hilbert Transform-based Generative Adversarial Network for Tattoo and Scene Text Spotting

    Get PDF
    Text spotting in natural scenes is of increasing interest and significance due to its critical role in several applications, such as visual question answering, named entity recognition and event rumor detection on social media. One of the newly emerging challenging problems is Tattoo Text Spotting (TTS) in images for assisting forensic teams and for person identification. Unlike the generally simpler scene text addressed by current state-of-the-art methods, tattoo text is typically characterized by the presence of decorative backgrounds, calligraphic handwriting and several distortions due to the deformable nature of the skin. This paper describes the first approach to address TTS in a real-world application context by designing an end-to-end text spotting method employing a Hilbert transform-based Generative Adversarial Network (GAN). To reduce the complexity of the TTS task, the proposed approach first detects fine details in the image using the Hilbert transform and the Optimum Phase Congruency (OPC). To overcome the challenges of only having a relatively small number of training samples, a GAN is then used for generating suitable text samples and descriptors for text spotting (i.e. both detection and recognition). The superior performance of the proposed TTS approach, for both tattoo and general scene text, over the state-of-the-art methods is demonstrated on a new TTS-specific dataset (publicly available 1) as well as on the existing benchmark natural scene text datasets: Total-Text, CTW1500 and ICDAR 2015

    Topological Deep Learning: Going Beyond Graph Data

    Full text link
    Topological deep learning is a rapidly growing field that pertains to the development of deep learning models for data supported on topological domains such as simplicial complexes, cell complexes, and hypergraphs, which generalize many domains encountered in scientific computations. In this paper, we present a unifying deep learning framework built upon a richer data structure that includes widely adopted topological domains. Specifically, we first introduce combinatorial complexes, a novel type of topological domain. Combinatorial complexes can be seen as generalizations of graphs that maintain certain desirable properties. Similar to hypergraphs, combinatorial complexes impose no constraints on the set of relations. In addition, combinatorial complexes permit the construction of hierarchical higher-order relations, analogous to those found in simplicial and cell complexes. Thus, combinatorial complexes generalize and combine useful traits of both hypergraphs and cell complexes, which have emerged as two promising abstractions that facilitate the generalization of graph neural networks to topological spaces. Second, building upon combinatorial complexes and their rich combinatorial and algebraic structure, we develop a general class of message-passing combinatorial complex neural networks (CCNNs), focusing primarily on attention-based CCNNs. We characterize permutation and orientation equivariances of CCNNs, and discuss pooling and unpooling operations within CCNNs in detail. Third, we evaluate the performance of CCNNs on tasks related to mesh shape analysis and graph learning. Our experiments demonstrate that CCNNs have competitive performance as compared to state-of-the-art deep learning models specifically tailored to the same tasks. Our findings demonstrate the advantages of incorporating higher-order relations into deep learning models in different applications
    corecore