Search CORE

473 research outputs found

Recent Trends and Techniques in Text Detection and Text Localization in a Natural Scene: A Survey

Author: Das Pranab
Prasad Vijay
Publication venue: Assam Don Bosco University
Publication date: 30/06/2021
Field of study

Text information extraction from natural scene images is a rising area of research. Since text in natural scene images generally carries valuable details, detecting and recognizing scene text has been deemed essential for a variety of advanced computer vision applications. There has been a lot of effort put into extracting text regions from scene text images in an effective and reliable manner. As most text recognition applications have high demand of robust algorithms for detecting and localizing texts from a given scene text image, so the researchers mainly focus on the two important stages text detection and text localization. This paper provides a review of various techniques of text detection and text localization

Assam Don Bosco University Journals

Text Detection in Natural Scenes and Technical Diagrams with Convolutional Feature Learning and Cascaded Classification

Author: Zhu Siyu
Publication venue: RIT Scholar Works
Publication date: 12/05/2016
Field of study

An enormous amount of digital images are being generated and stored every day. Understanding text in these images is an important challenge with large impacts for academic, industrial and domestic applications. Recent studies address the difficulty of separating text targets from noise and background, all of which vary greatly in natural scenes. To tackle this problem, we develop a text detection system to analyze and utilize visual information in a data driven, automatic and intelligent way. The proposed method incorporates features learned from data, including patch-based coarse-to-fine detection (Text-Conv), connected component extraction using region growing, and graph-based word segmentation (Word-Graph). Text-Conv is a sliding window-based detector, with convolution masks learned using the Convolutional k-means algorithm (Coates et. al, 2011). Unlike convolutional neural networks (CNNs), a single vector/layer of convolution mask responses are used to classify patches. An initial coarse detection considers both local and neighboring patch responses, followed by refinement using varying aspect ratios and rotations for a smaller local detection window. Different levels of visual detail from ground truth are utilized in each step, first using constraints on bounding box intersections, and then a combination of bounding box and pixel intersections. Combining masks from different Convolutional k-means initializations, e.g., seeded using random vectors and then support vectors improves performance. The Word-Graph algorithm uses contextual information to improve word segmentation and prune false character detections based on visual features and spatial context. Our system obtains pixel, character, and word detection f-measures of 93.14%, 90.26%, and 86.77% respectively for the ICDAR 2015 Robust Reading Focused Scene Text dataset, out-performing state-of-the-art systems, and producing highly accurate text detection masks at the pixel level. To investigate the utility of our feature learning approach for other image types, we perform tests on 8- bit greyscale USPTO patent drawing diagram images. An ensemble of Ada-Boost classifiers with different convolutional features (MetaBoost) is used to classify patches as text or background. The Tesseract OCR system is used to recognize characters in detected labels and enhance performance. With appropriate pre-processing and post-processing, f-measures of 82% for part label location, and 73% for valid part label locations and strings are obtained, which are the best obtained to-date for the USPTO patent diagram data set used in our experiments. To sum up, an intelligent refinement of convolutional k-means-based feature learning and novel automatic classification methods are proposed for text detection, which obtain state-of-the-art results without the need for strong prior knowledge. Different ground truth representations along with features including edges, color, shape and spatial relationships are used coherently to improve accuracy. Different variations of feature learning are explored, e.g. support vector-seeded clustering and MetaBoost, with results suggesting that increased diversity in learned features benefit convolution-based text detectors

RIT Scholar Works

Recommended from our members

Deep Structured Multi-Task Learning for Computer Vision in Autonomous Driving

Author: Teichmann Marvin
Publication venue: University of Cambridge
Publication date: 28/07/2020
Field of study

The field of computer vision is currently dominated by deep learning advances. Convolutional Neural Networks (CNNs) have become the predominant tool for solving almost any computer vision task, so state-of-the-art systems have been built by using the predictive capabilities of Convolutional Neural Networks (CNNs). Many of those systems use simple encoder–decoder based design, where an off-the-shelf CNN architecture is combined with a task-specific decoder and loss function in order to create an end-to-end trainable model. This ultimately raises the question of whether these kinds of models are the future of computer vision. In this thesis we argue that this is not the case. We start off by discussing three limitations of simple end-to-end training. We proceed by showing how it is possible to overcome those limitations by using an approach that we call structured modelling. The idea is to use CNNs to compute a rich semantic intermediate representation which is then used to solve the actual problem by applying a geometric and task-related structure. In this work we solve the localization, segmentation and landmark recognition task using structured modelling, and we show that this approach can improve generalization, interpretability and robustness. We also discuss how this approach is particularly useful for real-time applications such as autonomous driving. Visual perception is a multi-module problem that requires several different computer vision tasks to be solved. We discuss how, by sharing computations, we can improve not only the inference speed but also the prediction performance by using the structural relationship between the tasks. Lastly, we demonstrate that structured modelling is able to achieve state-of-the-art performance, making it a very relevant approach for solving current and future computer vision problems.Trinity College, ESPCR, Qualcom

Apollo (Cambridge)

Pedestrian Attribute Recognition: A Survey

Author: Luo Bin
Tang Jin
Wang Xiao
Yang Rui
Zheng Shaofei
Publication venue
Publication date: 22/01/2019
Field of study

Recognizing pedestrian attributes is an important task in computer vision community due to it plays an important role in video surveillance. Many algorithms has been proposed to handle this task. The goal of this paper is to review existing works using traditional methods or based on deep learning networks. Firstly, we introduce the background of pedestrian attributes recognition (PAR, for short), including the fundamental concepts of pedestrian attributes and corresponding challenges. Secondly, we introduce existing benchmarks, including popular datasets and evaluation criterion. Thirdly, we analyse the concept of multi-task learning and multi-label learning, and also explain the relations between these two learning algorithms and pedestrian attribute recognition. We also review some popular network architectures which have widely applied in the deep learning community. Fourthly, we analyse popular solutions for this task, such as attributes group, part-based, \emph{etc}. Fifthly, we shown some applications which takes pedestrian attributes into consideration and achieve better performance. Finally, we summarized this paper and give several possible research directions for pedestrian attributes recognition. The project page of this paper can be found from the following website: \url{https://sites.google.com/view/ahu-pedestrianattributes/}.Comment: Check our project page for High Resolution version of this survey: https://sites.google.com/view/ahu-pedestrianattributes

arXiv.org e-Print Archive

Automated High-resolution Earth Observation Image Interpretation: Outcome of the 2020 Gaofen Challenge

Author: Dang B.
Diao W.
Fu K.
Guo J.
Hansch R.
Lu X.
Sun X.
Wang C.
Wang P.
Wei W.
Weinmann M.
Xiang D.
Xu F.
Yan C.
Yan Z.
Yang Z.
Yokoya N.
Zhang Y.
Publication venue: Institute of Electrical and Electronics Engineers
Publication date: 01/01/2021
Field of study

In this article, we introduce the 2020 Gaofen Challenge and relevant scientific outcomes. The 2020 Gaofen Challenge is an international competition, which is organized by the China High-Resolution Earth Observation Conference Committee and the Aerospace Information Research Institute, Chinese Academy of Sciences and technically cosponsored by the IEEE Geoscience and Remote Sensing Society and the International Society for Photogrammetry and Remote Sensing. It aims at promoting the academic development of automated high-resolution earth observation image interpretation. Six independent tracks have been organized in this challenge, which cover the challenging problems in the field of object detection and semantic segmentation. With the development of convolutional neural networks, deep-learning-based methods have achieved good performance on image interpretation. In this article, we report the details and the best-performing methods presented so far in the scope of this challenge

Institute of Transport Research:Publications

KITopen

Directory of Open Access Journals

SkipcrossNets: Adaptive Skip-cross Fusion for Road Detection

Author: Gao Xin
Gong Yan
Jin Dafeng
Li Jun
Li Zhiwei
Liu Huaping
Zhang Xinyu
Publication venue
Publication date: 24/08/2023
Field of study

Multi-modal fusion is increasingly being used for autonomous driving tasks, as images from different modalities provide unique information for feature extraction. However, the existing two-stream networks are only fused at a specific network layer, which requires a lot of manual attempts to set up. As the CNN goes deeper, the two modal features become more and more advanced and abstract, and the fusion occurs at the feature level with a large gap, which can easily hurt the performance. In this study, we propose a novel fusion architecture called skip-cross networks (SkipcrossNets), which combines adaptively LiDAR point clouds and camera images without being bound to a certain fusion epoch. Specifically, skip-cross connects each layer to each layer in a feed-forward manner, and for each layer, the feature maps of all previous layers are used as input and its own feature maps are used as input to all subsequent layers for the other modality, enhancing feature propagation and multi-modal features fusion. This strategy facilitates selection of the most similar feature layers from two data pipelines, providing a complementary effect for sparse point cloud features during fusion processes. The network is also divided into several blocks to reduce the complexity of feature fusion and the number of model parameters. The advantages of skip-cross fusion were demonstrated through application to the KITTI and A2D2 datasets, achieving a MaxF score of 96.85% on KITTI and an F1 score of 84.84% on A2D2. The model parameters required only 2.33 MB of memory at a speed of 68.24 FPS, which could be viable for mobile terminals and embedded devices

arXiv.org e-Print Archive