8,222 research outputs found

    MANGO: A Mask Attention Guided One-Stage Scene Text Spotter

    Full text link
    Recently end-to-end scene text spotting has become a popular research topic due to its advantages of global optimization and high maintainability in real applications. Most methods attempt to develop various region of interest (RoI) operations to concatenate the detection part and the sequence recognition part into a two-stage text spotting framework. However, in such framework, the recognition part is highly sensitive to the detected results (e.g.), the compactness of text contours). To address this problem, in this paper, we propose a novel Mask AttentioN Guided One-stage text spotting framework named MANGO, in which character sequences can be directly recognized without RoI operation. Concretely, a position-aware mask attention module is developed to generate attention weights on each text instance and its characters. It allows different text instances in an image to be allocated on different feature map channels which are further grouped as a batch of instance features. Finally, a lightweight sequence decoder is applied to generate the character sequences. It is worth noting that MANGO inherently adapts to arbitrary-shaped text spotting and can be trained end-to-end with only coarse position information (e.g.), rectangular bounding box) and text annotations. Experimental results show that the proposed method achieves competitive and even new state-of-the-art performance on both regular and irregular text spotting benchmarks, i.e., ICDAR 2013, ICDAR 2015, Total-Text, and SCUT-CTW1500.Comment: Accepted to AAAI2021. Code is available at https://davar-lab.github.io/publication.html or https://github.com/hikopensource/DAVAR-Lab-OC

    Decoupling Recognition from Detection: Single Shot Self-Reliant Scene Text Spotter

    Full text link
    Typical text spotters follow the two-stage spotting strategy: detect the precise boundary for a text instance first and then perform text recognition within the located text region. While such strategy has achieved substantial progress, there are two underlying limitations. 1) The performance of text recognition depends heavily on the precision of text detection, resulting in the potential error propagation from detection to recognition. 2) The RoI cropping which bridges the detection and recognition brings noise from background and leads to information loss when pooling or interpolating from feature maps. In this work we propose the single shot Self-Reliant Scene Text Spotter (SRSTS), which circumvents these limitations by decoupling recognition from detection. Specifically, we conduct text detection and recognition in parallel and bridge them by the shared positive anchor point. Consequently, our method is able to recognize the text instances correctly even though the precise text boundaries are challenging to detect. Additionally, our method reduces the annotation cost for text detection substantially. Extensive experiments on regular-shaped benchmark and arbitrary-shaped benchmark demonstrate that our SRSTS compares favorably to previous state-of-the-art spotters in terms of both accuracy and efficiency.Comment: To be appeared in the Proceedings of the ACM International Conference on Multimedia (ACM MM), 202

    Turning a CLIP Model into a Scene Text Spotter

    Full text link
    We exploit the potential of the large-scale Contrastive Language-Image Pretraining (CLIP) model to enhance scene text detection and spotting tasks, transforming it into a robust backbone, FastTCM-CR50. This backbone utilizes visual prompt learning and cross-attention in CLIP to extract image and text-based prior knowledge. Using predefined and learnable prompts, FastTCM-CR50 introduces an instance-language matching process to enhance the synergy between image and text embeddings, thereby refining text regions. Our Bimodal Similarity Matching (BSM) module facilitates dynamic language prompt generation, enabling offline computations and improving performance. FastTCM-CR50 offers several advantages: 1) It can enhance existing text detectors and spotters, improving performance by an average of 1.7% and 1.5%, respectively. 2) It outperforms the previous TCM-CR50 backbone, yielding an average improvement of 0.2% and 0.56% in text detection and spotting tasks, along with a 48.5% increase in inference speed. 3) It showcases robust few-shot training capabilities. Utilizing only 10% of the supervised data, FastTCM-CR50 improves performance by an average of 26.5% and 5.5% for text detection and spotting tasks, respectively. 4) It consistently enhances performance on out-of-distribution text detection and spotting datasets, particularly the NightTime-ArT subset from ICDAR2019-ArT and the DOTA dataset for oriented object detection. The code is available at https://github.com/wenwenyu/TCM.Comment: arXiv admin note: text overlap with arXiv:2302.1433

    Diving into the Depths of Spotting Text in Multi-Domain Noisy Scenes

    Full text link
    When used in a real-world noisy environment, the capacity to generalize to multiple domains is essential for any autonomous scene text spotting system. However, existing state-of-the-art methods employ pretraining and fine-tuning strategies on natural scene datasets, which do not exploit the feature interaction across other complex domains. In this work, we explore and investigate the problem of domain-agnostic scene text spotting, i.e., training a model on multi-domain source data such that it can directly generalize to target domains rather than being specialized for a specific domain or scenario. In this regard, we present the community a text spotting validation benchmark called Under-Water Text (UWT) for noisy underwater scenes to establish an important case study. Moreover, we also design an efficient super-resolution based end-to-end transformer baseline called DA-TextSpotter which achieves comparable or superior performance over existing text spotting architectures for both regular and arbitrary-shaped scene text spotting benchmarks in terms of both accuracy and model efficiency. The dataset, code and pre-trained models will be released upon acceptance.Comment: 10 image

    Audio-visual modelling in a clinical setting

    Get PDF
    Auditory and visual signals are two primary perception modalities that are usually present together and correlate with each other, not only in natural environments but also in clinical settings. However, audio-visual modelling in the latter case can be more challenging, due to the different sources of audio/video signals and the noise (both signal-level and semantic-level) in auditory signals—usually speech audio. In this study, we consider audio-visual modelling in a clinical setting, providing a solution to learn medical representations that benefit various clinical tasks, without relying on dense supervisory annotations from human experts for the model training. A simple yet effective multi-modal self-supervised learning framework is presented for this purpose. The proposed approach is able to help find standard anatomical planes, predict the focusing position of sonographer’s eyes, and localise anatomical regions of interest during ultrasound imaging. Experimental analysis on a large-scale clinical multi-modal ultrasound video dataset show that the proposed novel representation learning method provides good transferable anatomical representations that boost the performance of automated downstream clinical tasks, even outperforming fully-supervised solutions. Being able to learn such medical representations in a self-supervised manner will contribute to several aspects including a better understanding of obstetric imaging, training new sonographers, more effective assistive tools for human experts, and enhancement of the clinical workflow

    Customized mask region based convolutional neural networks for un-uniformed shape text detection and text recognition

    Get PDF
    In image scene, text contains high-level of important information that helps to analyze and consider the particular environment. In this paper, we adapt image mask and original identification of the mask region based convolutional neural networks (R-CNN) to allow recognition at 3 levels such as sequence, holistic and pixel-level semantics. Particularly, pixel and holistic level semantics can be utilized to recognize the texts and define the text shapes, respectively. Precisely, in mask and detection, we segment and recognize both character and word instances. Furthermore, we implement text detection through the outcome of instance segmentation on 2-D feature-space. Also, to tackle and identify the text issues of smaller and blurry texts, we consider text recognition by attention-based of optical character recognition (OCR) model with the mask R-CNN at sequential level. The OCR module is used to estimate character sequence through feature maps of the word instances in sequence to sequence. Finally, we proposed a fine-grained learning technique that trains a more accurate and robust model by learning models from the annotated datasets at the word level. Our proposed approach is evaluated on popular benchmark dataset ICDAR 2013 and ICDAR 2015

    Audio-visual modelling in a clinical setting

    Get PDF
    Auditory and visual signals are two primary perception modalities that are usually present together and correlate with each other, not only in natural environments but also in clinical settings. However, audio-visual modelling in the latter case can be more challenging, due to the different sources of audio/video signals and the noise (both signal-level and semantic-level) in auditory signals—usually speech audio. In this study, we consider audio-visual modelling in a clinical setting, providing a solution to learn medical representations that benefit various clinical tasks, without relying on dense supervisory annotations from human experts for the model training. A simple yet effective multi-modal self-supervised learning framework is presented for this purpose. The proposed approach is able to help find standard anatomical planes, predict the focusing position of sonographer’s eyes, and localise anatomical regions of interest during ultrasound imaging. Experimental analysis on a large-scale clinical multi-modal ultrasound video dataset show that the proposed novel representation learning method provides good transferable anatomical representations that boost the performance of automated downstream clinical tasks, even outperforming fully-supervised solutions. Being able to learn such medical representations in a self-supervised manner will contribute to several aspects including a better understanding of obstetric imaging, training new sonographers, more effective assistive tools for human experts, and enhancement of the clinical workflow

    SPTS v2: Single-Point Scene Text Spotting

    Full text link
    End-to-end scene text spotting has made significant progress due to its intrinsic synergy between text detection and recognition. Previous methods commonly regard manual annotations such as horizontal rectangles, rotated rectangles, quadrangles, and polygons as a prerequisite, which are much more expensive than using single-point. For the first time, we demonstrate that training scene text spotting models can be achieved with an extremely low-cost single-point annotation by the proposed framework, termed SPTS v2. SPTS v2 reserves the advantage of the auto-regressive Transformer with an Instance Assignment Decoder (IAD) through sequentially predicting the center points of all text instances inside the same predicting sequence, while with a Parallel Recognition Decoder (PRD) for text recognition in parallel. These two decoders share the same parameters and are interactively connected with a simple but effective information transmission process to pass the gradient and information. Comprehensive experiments on various existing benchmark datasets demonstrate the SPTS v2 can outperform previous state-of-the-art single-point text spotters with fewer parameters while achieving 19Ă—\times faster inference speed. Most importantly, within the scope of our SPTS v2, extensive experiments further reveal an important phenomenon that single-point serves as the optimal setting for the scene text spotting compared to non-point, rectangular bounding box, and polygonal bounding box. Such an attempt provides a significant opportunity for scene text spotting applications beyond the realms of existing paradigms. Code will be available at https://github.com/bytedance/SPTSv2.Comment: arXiv admin note: text overlap with arXiv:2112.0791

    DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting

    Full text link
    End-to-end text spotting aims to integrate scene text detection and recognition into a unified framework. Dealing with the relationship between the two sub-tasks plays a pivotal role in designing effective spotters. Although transformer-based methods eliminate the heuristic post-processing, they still suffer from the synergy issue between the sub-tasks and low training efficiency. In this paper, we present DeepSolo, a simple detection transformer baseline that lets a single Decoder with Explicit Points Solo for text detection and recognition simultaneously. Technically, for each text instance, we represent the character sequence as ordered points and model them with learnable explicit point queries. After passing a single decoder, the point queries have encoded requisite text semantics and locations and thus can be further decoded to the center line, boundary, script, and confidence of text via very simple prediction heads in parallel, solving the sub-tasks in text spotting in a unified framework. Besides, we also introduce a text-matching criterion to deliver more accurate supervisory signals, thus enabling more efficient training. Quantitative experiments on public benchmarks demonstrate that DeepSolo outperforms previous state-of-the-art methods and achieves better training efficiency. In addition, DeepSolo is also compatible with line annotations, which require much less annotation cost than polygons. The code will be released.Comment: The code will be available at https://github.com/ViTAE-Transformer/DeepSol
    • …
    corecore