6 research outputs found
Text Detection and Recognition in the Wild
Text detection and recognition (TDR) in highly structured environments with a clean background and consistent fonts (e.g., office documents, postal addresses and bank cheque) is a well understood problem (i.e., OCR), however this is not the case for unstructured environments.
The main objective for scene text detection is to locate text within images captured in the wild.
For scene text recognition, the techniques map each detected or cropped word image into string.
Nowadays, convolutional neural networks (CNNs) and Recurrent Neural Networks (RNN) deep learning architectures dominate most of the recent state-of-the-art (SOTA) scene TDR methods.
Most of the reported respective accuracies of current SOTA TDR methods are in the range of 80% to 90% on benchmark datasets with regular and clear text instances. However, those detecting and/or recognizing results drastically deteriorate 10% and 30% - in terms of F-measure detection and word recognition accuracy performances with irregular or occluded text images.
Transformers and their variations are new deep learning architectures that mitigate the above-mentioned issues for CNN and RNN-based pipelines.Unlike Recurrent Neural Networks (RNNs),
transformers are models that learn how to encode and decode data by looking not only backward but also forward in order to extract relevant information from a whole sequence.
This thesis utilizes the transformer architecture to address the irregular (multi-oriented and arbitrarily shaped) and occluded text challenges in the wild images. Our main contributions are as follows:
(1) We first targeted solving the irregular TDR in two separate architectures as follows:
In Chapter 4, unlike the SOTA text detection frameworks that have complex pipelines and use many hand-designed components and post-processing stages, we design a conceptually more straightforward and trainable end-to-end architecture of transformer-based detector for multi-oriented scene text detection, which can directly predict the set of detections (i.e., text and box regions) of the input image. A central contribution to our work is introducing a loss function tailored to the rotated text detection problem that leverages a rotated version of a generalized intersection over union score to capture the rotated text instances adequately.
In Chapter 5, we extend our previous architecture to arbitrary shaped scene text detection.
We design a new text detection technique that aims to better infer n-vertices of a polygon or the degree of a Bezier curve to represent irregular-text instances.
We also propose a loss function that combines a generalized-split-intersection-over union loss defined over the piece-wise polygons.
In Chapter 6, we show that our transformer-based architecture without rectifying the input curved text instances is more suitable than SOTA RNN-based frameworks equipped with rectification modules for irregular text recognition in the wild images.
Our main contribution to this chapter is leveraging a 2D Learnable Sinusoidal frequencies Positional Encoding (2LSPE) with a modified feed-forward neural network to better encode the 2D spatial dependencies of characters in the irregular text instances.
(2) Since TDR tasks encounter the same challenging problems (e.g., irregular text, illumination variations, low-resolution text, etc.), we present a new transformer model that can detect and recognize individual characters of text instances in an end-to-end manner. Reading individual characters later makes a robust occlusion and arbitrarily shaped text spotting model without needing polygon annotation or multiple stages of detection and recognition modules used in SOTA text spotting architectures.
In Chapter 7, unlike SOTA methods that combine two different pipelines of detection and recognition modules for a complete text reading, we utilize our text detection framework by leveraging a recent transformer-based technique, namely Deformable Patch-based Transformer (DPT), as a feature extracting backbone, to robustly read the class and box coordinates of irregular characters in the wild images.
(3) Finally, we address the occlusion problem by using a multi-task end-to-end scene text spotting framework.
In Chapter 8, we leverage a recent transformer-based framework in deep learning, namely Masked Auto Encoder (MAE), as a backbone for scene text recognition and end-to-end scene text spotting pipelines to overcome the partial occlusion limitation. We design a new multitask End-to-End transformer network that directly outputs characters, word instances, and their bounding box representations, saving the computational overhead as it eliminates multiple processing steps. The unified proposed framework can also detect and recognize arbitrarily shaped text instances without using polygon annotations
Investigation of Deep Learning Optimization Algorithms in Scene Text Detection
Scene text detection frameworks heavily rely on optimization methods for their successful operation. Choosing an appropriate optimizer is essential to performing recent scene text detection models. However, recent deep learning methods often employ various optimization algorithms and loss functions without explicitly explaining their selections. This paper presents a segmentation-based text detection pipeline capable of handling arbitrary-shaped text instances in wild images. We explore the effectiveness of well-known deep-learning optimizers to enhance the pipeline's capabilities. Additionally, we introduce a novel Segmentation-based Attention Module (SAM) that enables the model to capture long-range dependencies of multi-scale feature maps and focus more accurately on regions likely to contain text instances.The performance of the proposed architecture is extensively evaluated through ablation experiments, exploring the impact of different optimization algorithms and the introduced SAM block. Furthermore, we compare the final model against state-of-the-art scene text detection techniques on three publicly available benchmark datasets, namely ICDAR15, MSRA-TD500, and Total-Text. Our experimental results demonstrate that the focal loss combined with the Stochastic Gradient Descent (SGD) + Momentum optimizer with poly learning-rate policy achieves a more robust and generalized detection performance than other optimization strategies. Moreover, our utilized architecture, empowered by the proposed SAM block, significantly enhances the overall detection performance, achieving competitive H-mean detection scores while maintaining superior efficiency in terms of Frames Per Second (FPS) compared to recent techniques. Our findings shed light on the importance of selecting appropriate optimization strategies and demonstrate the effectiveness of our proposed Segmentation-based Attention Module in scene text detection tasks
Challenges of Deep Learning-based Text Detection in the Wild
The reported accuracy of recent state-of-the-art text detection methods, mostly deep learning approaches, is in the order of 80% to 90% on standard benchmark datasets. These methods have relaxed some of the restrictions of structured text and environment (i.e., "in the wild") which are usually required for classical OCR to properly function. Even with this relaxation, there are still circumstances where these state-of-the-art methods fail. Several remaining challenges in wild images, like in-plane-rotation, illumination reflection, partial occlusion, complex font styles, and perspective distortion, cause exciting methods to perform poorly. In order to evaluate current approaches in a formal way, we standardize the datasets and metrics for comparison which had made comparison between these methods difficult in the past. We use three benchmark datasets for our evaluations: ICDAR13, ICDAR15, and COCO-Text V2.0. The objective of the paper is to quantify the current shortcomings and to identify the challenges for future text detection research
2D Positional Embedding-based Transformer for Scene Text Recognition
Recent state-of-the-art scene text recognition methods are primarily based on Recurrent Neural Networks (RNNs), however, these methods require one-dimensional (1D) features and are not designed for recognizing irregular-text instances due to the loss of spatial information present in the original two-dimensional (2D) images. In this paper, we leverage a Transformer-based architecture for recognizing both regular and irregular text-in-the-wild images. The proposed method takes advantage of using a 2D positional encoder with the Transformer architecture to better preserve the spatial information of 2D image features than previous methods. The experiments on popular benchmarks, including the challenging COCO-Text dataset, demonstrate that the proposed scene text recognition method outperformed the state-of-the-art in most cases, especially on irregular-text recognition
End-to-End Scene Text Spotting at Character Level
This work utilizes the new object detection framework, namely Detection using Transformers (DETR), to spot the characters in unconstrained environments (i.e., in the wild), which offers simpler and robust end-to-end architecture than the previous methods.The proposed framework leverages an adaptive feature extraction to better focus on the position of character regions and a bounding box loss function that is more precise in spotting characters with different scales and aspect ratios.To evaluate our proposed architecture's effect, we conduct experiments on the ICDAR benchmark designed explicitly for character-level text detection, namely the ICDAR13 dataset.Experimental results show that the proposed method outperforms the state-of-the-art detectors when tested on the mentioned benchmark
End-to-End Scene Text Spotting at Character Level
This work utilizes the new object detection framework, namely Detection using Transformers (DETR), to spot the characters in unconstrained environments (i.e., in the wild), which offers simpler and robust end-to-end architecture than the previous methods.The proposed framework leverages an adaptive feature extraction to better focus on the position of character regions and a bounding box loss function that is more precise in spotting characters with different scales and aspect ratios.To evaluate our proposed architecture's effect, we conduct experiments on the ICDAR benchmark designed explicitly for character-level text detection, namely the ICDAR13 dataset.Experimental results show that the proposed method outperforms the state-of-the-art detectors when tested on the mentioned benchmark