7 research outputs found

    MANGO: A Mask Attention Guided One-Stage Scene Text Spotter

    Full text link
    Recently end-to-end scene text spotting has become a popular research topic due to its advantages of global optimization and high maintainability in real applications. Most methods attempt to develop various region of interest (RoI) operations to concatenate the detection part and the sequence recognition part into a two-stage text spotting framework. However, in such framework, the recognition part is highly sensitive to the detected results (e.g.), the compactness of text contours). To address this problem, in this paper, we propose a novel Mask AttentioN Guided One-stage text spotting framework named MANGO, in which character sequences can be directly recognized without RoI operation. Concretely, a position-aware mask attention module is developed to generate attention weights on each text instance and its characters. It allows different text instances in an image to be allocated on different feature map channels which are further grouped as a batch of instance features. Finally, a lightweight sequence decoder is applied to generate the character sequences. It is worth noting that MANGO inherently adapts to arbitrary-shaped text spotting and can be trained end-to-end with only coarse position information (e.g.), rectangular bounding box) and text annotations. Experimental results show that the proposed method achieves competitive and even new state-of-the-art performance on both regular and irregular text spotting benchmarks, i.e., ICDAR 2013, ICDAR 2015, Total-Text, and SCUT-CTW1500.Comment: Accepted to AAAI2021. Code is available at https://davar-lab.github.io/publication.html or https://github.com/hikopensource/DAVAR-Lab-OC

    TTS: Hilbert Transform-based Generative Adversarial Network for Tattoo and Scene Text Spotting

    Get PDF
    Text spotting in natural scenes is of increasing interest and significance due to its critical role in several applications, such as visual question answering, named entity recognition and event rumor detection on social media. One of the newly emerging challenging problems is Tattoo Text Spotting (TTS) in images for assisting forensic teams and for person identification. Unlike the generally simpler scene text addressed by current state-of-the-art methods, tattoo text is typically characterized by the presence of decorative backgrounds, calligraphic handwriting and several distortions due to the deformable nature of the skin. This paper describes the first approach to address TTS in a real-world application context by designing an end-to-end text spotting method employing a Hilbert transform-based Generative Adversarial Network (GAN). To reduce the complexity of the TTS task, the proposed approach first detects fine details in the image using the Hilbert transform and the Optimum Phase Congruency (OPC). To overcome the challenges of only having a relatively small number of training samples, a GAN is then used for generating suitable text samples and descriptors for text spotting (i.e. both detection and recognition). The superior performance of the proposed TTS approach, for both tattoo and general scene text, over the state-of-the-art methods is demonstrated on a new TTS-specific dataset (publicly available 1) as well as on the existing benchmark natural scene text datasets: Total-Text, CTW1500 and ICDAR 2015

    Computational Efficiency Studies in Computer Vision Tasks

    Get PDF
    Computer vision has made massive progress in recent years, thanks to hardware and algorithms development. Most methods are performance-driven meanwhile have a lack of consideration for energy efficiency. This dissertation proposes computational efficiency boosting methods for three different vision tasks: ultra-high resolution images segmentation, optical characters recognition for Unmanned Aerial Vehicles (UAV) based videos, and multiple object detection for UAV based videos. The pattern distribution of ultra-high resolution images is usually unbalanced. While part of an image contains complex and fine-grained patterns such as boundaries, most areas are composed of simple and repeated patterns. In the first chapter, we propose to learn a skip map, which can guide a segmentation network to skip simple patterns and hence reduce computational complexity. Specifically, the skip map highlights simple-pattern areas that can be down-sampled for processing at a lower resolution, while the remaining complex part is still segmented at the original resolution. Applied on the state-of-the-art ultra-high resolution image segmentation network GLNet, our proposed skip map saves more than 30% computation while maintaining comparable segmentation performance. In the second chapter, we propose an end-to-end system for UAV videos OCR framework. We first revisit RCNN’s crop & resize training strategy and empirically find that it outperforms aligned RoI sampling on a real-world video text dataset captured by UAV. We further propose a multi-stage image processor that takes videos’ redundancy, continuity, and mixed degradation into account to reduce energy consumption. Lastly, the model is pruned and quantized before deployed on Raspberry Pi. Our proposed energy-efficient video text spotting solution, dubbed as E²VTS, outperforms all previous methods by achieving a competitive tradeoff between energy efficiency and performance. In the last chapter, we propose an energy-efficient video multiple objects detection solution. Besides designing a fast multiple object detector, we propose a data synthesis and a knowledge transfer-based annotation method to overcome class imbalance and domain gap issues. This solution was implemented on LPCVC 2021 UVA challenge and judged to be the first-place winner
    corecore