12 research outputs found

    PMODE: Prototypical Mask based Object Dimension Estimation

    Full text link
    Can a neural network estimate an object's dimension in the wild? In this paper, we propose a method and deep learning architecture to estimate the dimensions of a quadrilateral object of interest in videos using a monocular camera. The proposed technique does not use camera calibration or handcrafted geometric features; however, features are learned with the help of coefficients of a segmentation neural network during the training process. A real-time instance segmentation-based Deep Neural Network with a ResNet50 backbone is employed, giving the object's prototype mask and thus provides a region of interest to regress its dimensions. The instance segmentation network is trained to look at only the nearest object of interest. The regression is performed using an MLP head which looks only at the mask coefficients of the bounding box detector head and the prototype segmentation mask. We trained the system with three different random cameras achieving 22% MAPE for the test dataset for the dimension estimationComment: 10 page

    Text-detection and -recognition from natural images

    Get PDF
    Text detection and recognition from images could have numerous functional applications for document analysis, such as assistance for visually impaired people; recognition of vehicle license plates; evaluation of articles containing tables, street signs, maps, and diagrams; keyword-based image exploration; document retrieval; recognition of parts within industrial automation; content-based extraction; object recognition; address block location; and text-based video indexing. This research exploited the advantages of artificial intelligence (AI) to detect and recognise text from natural images. Machine learning and deep learning were used to accomplish this task.In this research, we conducted an in-depth literature review on the current detection and recognition methods used by researchers to identify the existing challenges, wherein the differences in text resulting from disparity in alignment, style, size, and orientation combined with low image contrast and a complex background make automatic text extraction a considerably challenging and problematic task. Therefore, the state-of-the-art suggested approaches obtain low detection rates (often less than 80%) and recognition rates (often less than 60%). This has led to the development of new approaches. The aim of the study was to develop a robust text detection and recognition method from natural images with high accuracy and recall, which would be used as the target of the experiments. This method could detect all the text in the scene images, despite certain specific features associated with the text pattern. Furthermore, we aimed to find a solution to the two main problems concerning arbitrarily shaped text (horizontal, multi-oriented, and curved text) detection and recognition in a low-resolution scene and with various scales and of different sizes.In this research, we propose a methodology to handle the problem of text detection by using novel combination and selection features to deal with the classification algorithms of the text/non-text regions. The text-region candidates were extracted from the grey-scale images by using the MSER technique. A machine learning-based method was then applied to refine and validate the initial detection. The effectiveness of the features based on the aspect ratio, GLCM, LBP, and HOG descriptors was investigated. The text-region classifiers of MLP, SVM, and RF were trained using selections of these features and their combinations. The publicly available datasets ICDAR 2003 and ICDAR 2011 were used to evaluate the proposed method. This method achieved the state-of-the-art performance by using machine learning methodologies on both databases, and the improvements were significant in terms of Precision, Recall, and F-measure. The F-measure for ICDAR 2003 and ICDAR 2011 was 81% and 84%, respectively. The results showed that the use of a suitable feature combination and selection approach could significantly increase the accuracy of the algorithms.A new dataset has been proposed to fill the gap of character-level annotation and the availability of text in different orientations and of curved text. The proposed dataset was created particularly for deep learning methods which require a massive completed and varying range of training data. The proposed dataset includes 2,100 images annotated at the character and word levels to obtain 38,500 samples of English characters and 12,500 words. Furthermore, an augmentation tool has been proposed to support the proposed dataset. The missing of object detection augmentation tool encroach to proposed tool which has the ability to update the position of bounding boxes after applying transformations on images. This technique helps to increase the number of samples in the dataset and reduce the time of annotations where no annotation is required. The final part of the thesis presents a novel approach for text spotting, which is a new framework for an end-to-end character detection and recognition system designed using an improved SSD convolutional neural network, wherein layers are added to the SSD networks and the aspect ratio of the characters is considered because it is different from that of the other objects. Compared with the other methods considered, the proposed method could detect and recognise characters by training the end-to-end model completely. The performance of the proposed method was better on the proposed dataset; it was 90.34. Furthermore, the F-measure of the method’s accuracy on ICDAR 2015, ICDAR 2013, and SVT was 84.5, 91.9, and 54.8, respectively. On ICDAR13, the method achieved the second-best accuracy. The proposed method could spot text in arbitrarily shaped (horizontal, oriented, and curved) scene text.</div

    Generic signboard detection in image and video.

    Get PDF
    by Shen Hua.Thesis (M.Phil.)--Chinese University of Hong Kong, 2003.Includes bibliographical references (leaves 67-71).Abstracts in English and Chinese.Abstract --- p.i摘要 --- p.iiiAcknowledgments --- p.vTable of Contents --- p.viiList of Figures --- p.ixChapter Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Object Detection --- p.2Chapter 1.2 --- Signboard Detection --- p.3Chapter Chapter 2 --- System Overview --- p.5Chapter 2.1 --- What is the problem? --- p.5Chapter 2.2 --- Review of previous work --- p.6Chapter 2.3 --- System Outline --- p.8Chapter Chapter 3 --- Preprocessing --- p.10Chapter 3.1 --- Edge Detection --- p.11Chapter 3.1.1 --- Gradient-Based Method --- p.11Chapter 3.1.2 --- Laplacian of Gaussian --- p.14Chapter 3.1.3 --- Canny edge detection --- p.15Chapter 3.2 --- Corner Detection --- p.18Chapter Chapter 4 --- Finding Candidate Lines --- p.22Chapter 4.1 --- Hough Transform --- p.22Chapter 4.1.1 --- What is Hough Transform --- p.22Chapter 4.1.2 --- Parameter Space --- p.22Chapter 4.1.3 --- Accumulator Array --- p.24Chapter 4.2 --- Gradient-based Hough Transform --- p.25Chapter 4.2.1 --- Direction of Gradient --- p.26Chapter 4.2.2 --- Accumulator Array --- p.28Chapter 4.2.3 --- Peaks in the accumulator array --- p.30Chapter 4.2.4 --- Performance of Gradient-based Hough Transform --- p.32Chapter Chapter 5 --- Signboards Locating --- p.35Chapter 5.1 --- Line Verification --- p.35Chapter 5.1.1 --- Line Segmentation --- p.35Chapter 5.1.2 --- Density Checking --- p.37Chapter 5.2 --- Finding Close Circuits --- p.40Chapter 5.3 --- Remove Redundant Segments --- p.47Chapter Chapter 6 --- Post processing --- p.54Chapter Chapter 7 --- Experiments and Conclusion --- p.59Chapter 7.1 --- Experimental Results --- p.59Chapter 7.2 --- Conclusion --- p.66Bibliography --- p.6

    ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy in Transformer

    Full text link
    In recent years, end-to-end scene text spotting approaches are evolving to the Transformer-based framework. While previous studies have shown the crucial importance of the intrinsic synergy between text detection and recognition, recent advances in Transformer-based methods usually adopt an implicit synergy strategy with shared query, which can not fully realize the potential of these two interactive tasks. In this paper, we argue that the explicit synergy considering distinct characteristics of text detection and recognition can significantly improve the performance text spotting. To this end, we introduce a new model named Explicit Synergy-based Text Spotting Transformer framework (ESTextSpotter), which achieves explicit synergy by modeling discriminative and interactive features for text detection and recognition within a single decoder. Specifically, we decompose the conventional shared query into task-aware queries for text polygon and content, respectively. Through the decoder with the proposed vision-language communication module, the queries interact with each other in an explicit manner while preserving discriminative patterns of text detection and recognition, thus improving performance significantly. Additionally, we propose a task-aware query initialization scheme to ensure stable training. Experimental results demonstrate that our model significantly outperforms previous state-of-the-art methods. Code is available at https://github.com/mxin262/ESTextSpotter.Comment: Accepted to ICCV 202

    Development of a text reading system on video images

    Get PDF
    Since the early days of computer science researchers sought to devise a machine which could automatically read text to help people with visual impairments. The problem of extracting and recognising text on document images has been largely resolved, but reading text from images of natural scenes remains a challenge. Scene text can present uneven lighting, complex backgrounds or perspective and lens distortion; it usually appears as short sentences or isolated words and shows a very diverse set of typefaces. However, video sequences of natural scenes provide a temporal redundancy that can be exploited to compensate for some of these deficiencies. Here we present a complete end-to-end, real-time scene text reading system on video images based on perspective aware text tracking. The main contribution of this work is a system that automatically detects, recognises and tracks text in videos of natural scenes in real-time. The focus of our method is on large text found in outdoor environments, such as shop signs, street names and billboards. We introduce novel efficient techniques for text detection, text aggregation and text perspective estimation. Furthermore, we propose using a set of Unscented Kalman Filters (UKF) to maintain each text region¿s identity and to continuously track the homography transformation of the text into a fronto-parallel view, thereby being resilient to erratic camera motion and wide baseline changes in orientation. The orientation of each text line is estimated using a method that relies on the geometry of the characters themselves to estimate a rectifying homography. This is done irrespective of the view of the text over a large range of orientations. We also demonstrate a wearable head-mounted device for text reading that encases a camera for image acquisition and a pair of headphones for synthesized speech output. Our system is designed for continuous and unsupervised operation over long periods of time. It is completely automatic and features quick failure recovery and interactive text reading. It is also highly parallelised in order to maximize the usage of available processing power and to achieve real-time operation. We show comparative results that improve the current state-of-the-art when correcting perspective deformation of scene text. The end-to-end system performance is demonstrated on sequences recorded in outdoor scenarios. Finally, we also release a dataset of text tracking videos along with the annotated ground-truth of text regions

    Deep Learning for Scene Text Detection, Recognition, and Understanding

    Get PDF
    Detecting and recognizing texts in images is a long-standing task in computer vision. The goal of this task is to extract textual information from images and videos, such as recognizing license plates. Despite that the great progresses have been made in recent years, it still remains challenging due to the wide range of variations in text appearance. In this thesis, we aim to review the existing issues that hinder current Optical Character Recognition (OCR) development and explore potential solutions. Specifically, we first investigate the phenomenon of unfair comparisons between different OCR algorithms caused due to the lack of a consistent evaluation framework. Such an absence of a unified evaluation protocol leads to inconsistent and unreliable results, making it difficult to compare and improve upon existing methods. To tackle this issue, we design a new evaluation framework from the aspect of datasets, metrics, and models, enabling consistent and fair comparisons between OCR systems. Another issue existing in the field is the imbalanced distribution of training samples. In particular, the sample distribution largely depended on where and how the data was collected, and the resulting data bias may lead to poor performance and low generalizability on under-represented classes. To address this problem, we took the driving license plate recognition task as an example and proposed a text-to-image model that is able to synthesize photo-realistic text samples. By using this model, we synthesized more than one million samples to augment the training dataset, significantly improving the generalization capability of OCR models. Additionally, this thesis also explores the application of text vision question answering, which is a new and emerging research topic among the OCR community. This task challenges the OCR models to understand the relationships between the text and backgrounds and to answer the given questions. In this thesis, we propose to investigate evidence-based text VQA, which involves designing models that can provide reasonable evidence for their predictions, thus improving the generalization ability.Thesis (Ph.D.) -- University of Adelaide, School of Computer and Mathematical Sciences, 202

    Scanning Documents Using Camera

    Get PDF
    Import 03/08/2012Tato bakalářská práce se zabývá kopírováním textu pomocí fotoaparátu a webových kamer. V úvodní části vysvětluje současný stav, možnosti kopírování textu a popisuje základní používané postupy. Následující část práce je zaměřena na teoretické principy, které jsou rozebrány a sepsány. Tyto principy jsou uplatněny v praktické části se samotným řešením problému. Nakonec jsou provedeny experimenty ověřující správnost řešení problematiky a tyto experimenty jsou doplněny o praktické ukázky. V závěru je shrnuta řešená práce a jsou zhodnoceny výsledky.This bachelor thesis deals with text copying using the camera and webcamera. The introduction section explains current situation, possibilities of text copying and describes the basic procedures used. Next section of the work is focused on theoretical principles, that are analyzed and described. These principles are applied in the practical part with the solution of the problem. Verifivation of the solution correctness is demonstrated using practical aplications. The conclusion summarizes this thesis and evaluates results.460 - Katedra informatikyvýborn
    corecore