Text Detection and Recognition in the Wild

Abstract

Text detection and recognition (TDR) in highly structured environments with a clean background and consistent fonts (e.g., office documents, postal addresses and bank cheque) is a well understood problem (i.e., OCR), however this is not the case for unstructured environments. The main objective for scene text detection is to locate text within images captured in the wild. For scene text recognition, the techniques map each detected or cropped word image into string. Nowadays, convolutional neural networks (CNNs) and Recurrent Neural Networks (RNN) deep learning architectures dominate most of the recent state-of-the-art (SOTA) scene TDR methods. Most of the reported respective accuracies of current SOTA TDR methods are in the range of 80% to 90% on benchmark datasets with regular and clear text instances. However, those detecting and/or recognizing results drastically deteriorate 10% and 30% - in terms of F-measure detection and word recognition accuracy performances with irregular or occluded text images. Transformers and their variations are new deep learning architectures that mitigate the above-mentioned issues for CNN and RNN-based pipelines.Unlike Recurrent Neural Networks (RNNs), transformers are models that learn how to encode and decode data by looking not only backward but also forward in order to extract relevant information from a whole sequence. This thesis utilizes the transformer architecture to address the irregular (multi-oriented and arbitrarily shaped) and occluded text challenges in the wild images. Our main contributions are as follows: (1) We first targeted solving the irregular TDR in two separate architectures as follows: In Chapter 4, unlike the SOTA text detection frameworks that have complex pipelines and use many hand-designed components and post-processing stages, we design a conceptually more straightforward and trainable end-to-end architecture of transformer-based detector for multi-oriented scene text detection, which can directly predict the set of detections (i.e., text and box regions) of the input image. A central contribution to our work is introducing a loss function tailored to the rotated text detection problem that leverages a rotated version of a generalized intersection over union score to capture the rotated text instances adequately. In Chapter 5, we extend our previous architecture to arbitrary shaped scene text detection. We design a new text detection technique that aims to better infer n-vertices of a polygon or the degree of a Bezier curve to represent irregular-text instances. We also propose a loss function that combines a generalized-split-intersection-over union loss defined over the piece-wise polygons. In Chapter 6, we show that our transformer-based architecture without rectifying the input curved text instances is more suitable than SOTA RNN-based frameworks equipped with rectification modules for irregular text recognition in the wild images. Our main contribution to this chapter is leveraging a 2D Learnable Sinusoidal frequencies Positional Encoding (2LSPE) with a modified feed-forward neural network to better encode the 2D spatial dependencies of characters in the irregular text instances. (2) Since TDR tasks encounter the same challenging problems (e.g., irregular text, illumination variations, low-resolution text, etc.), we present a new transformer model that can detect and recognize individual characters of text instances in an end-to-end manner. Reading individual characters later makes a robust occlusion and arbitrarily shaped text spotting model without needing polygon annotation or multiple stages of detection and recognition modules used in SOTA text spotting architectures. In Chapter 7, unlike SOTA methods that combine two different pipelines of detection and recognition modules for a complete text reading, we utilize our text detection framework by leveraging a recent transformer-based technique, namely Deformable Patch-based Transformer (DPT), as a feature extracting backbone, to robustly read the class and box coordinates of irregular characters in the wild images. (3) Finally, we address the occlusion problem by using a multi-task end-to-end scene text spotting framework. In Chapter 8, we leverage a recent transformer-based framework in deep learning, namely Masked Auto Encoder (MAE), as a backbone for scene text recognition and end-to-end scene text spotting pipelines to overcome the partial occlusion limitation. We design a new multitask End-to-End transformer network that directly outputs characters, word instances, and their bounding box representations, saving the computational overhead as it eliminates multiple processing steps. The unified proposed framework can also detect and recognize arbitrarily shaped text instances without using polygon annotations

    Similar works