17 research outputs found

    WordFences: Text localization and recognition

    Get PDF
    En col·laboració amb la Universitat de Barcelona (UB) i la Universitat Rovira i Virgili (URV)In recent years, text recognition has achieved remarkable success in recognizing scanned document text. However, word recognition in natural images is still an open problem, which generally requires time consuming post-processing steps. We present a novel architecture for individual word detection in scene images based on semantic segmentation. Our contributions are twofold: the concept of WordFence, which detects border areas surrounding each individual word and a unique pixelwise weighted softmax loss function which penalizes background and emphasizes small text regions. WordFence ensures that each word is detected individually, and the new loss function provides a strong training signal to both text and word border localization. The proposed technique avoids intensive post-processing by combining semantic word segmentation with a voting scheme for merging segmentations of multiple scales, producing an end-to-end word detection system. We achieve superior localization recall on common benchmark datasets - 92% recall on ICDAR11 and ICDAR13 and 63% recall on SVT. Furthermore, end-to-end word recognition achieves state-of-the-art 86% F-Score on ICDAR13

    Scene Based Text Recognition From Natural Images and Classification Based on Hybrid CNN Models with Performance Evaluation

    Get PDF
    Similar to the recognition of captions, pictures, or overlapped text that typically appears horizontally, multi-oriented text recognition in video frames is challenging since it has high contrast related to its background. Multi-oriented form of text normally denotes scene text which makes text recognition further stimulating and remarkable owing to the disparaging features of scene text. Hence, predictable text detection approaches might not give virtuous outcomes for multi-oriented scene text detection. Text detection from any such natural image has been challenging since earlier times, and significant enhancement has been made recently to execute this task. While coming to blurred, low-resolution, and small-sized images, most of the previous research conducted doesn’t work well; hence, there is a research gap in that area. Scene-based text detection is a key area due to its adverse applications. One such primary reason for the failure of earlier methods is that the existing methods could not generate precise alignments across feature areas and targets for those images. This research focuses on scene-based text detection with the aid of YOLO based object detector and a CNN-based classification approach. The experiments were conducted in MATLAB 2019A, and the packages used were RESNET50, INCEPTIONRESNETV2, and DENSENET201. The efficiency of the proposed methodology - Hybrid resnet -YOLO procured maximum accuracy of 91%, Hybrid inceptionresnetv2 -YOLO of 81.2%, and Hybrid densenet201 -YOLO of 83.1% and was verified by comparing it with the existing research works Resnet50 of 76.9%, ResNet-101 of 79.5%, and ResNet-152 of 82%

    Improving Efficiency for Object Detection and Temporal Modeling for Action Localization

    Get PDF
    Despite their great predictive capability, Convolutional Neural Networks (CNNs) are computational-expensive to deploy and usually require a tremendous amount of annotated data at training time. When analyzing videos, it is very important and challenging to model temporal dynamics due to large appearance variation and complex semantics. We propose methods to improve efficiency of model deployment for object detection in images and to capture temporal dependencies for online action detection in videos. To relieve the demand of human labor for data annotation, we introduce approaches to conduct object detection and natural language localization using weak supervisions. First, we introduce a generic framework that reduces the computational cost of object detection while retaining accuracy for scenarios where objects with varied sizes appear in high resolution images. Detection progresses in a coarse-to-fine manner, first on a down-sampled version of the image and then on a sequence of higher resolution regions identified as likely to improve the detection accuracy. Built upon reinforcement learning, our approach consists of a model (R-net) that uses coarse detection results to predict the potential accuracy gain for analyzing a region at a higher resolution and another model (Q-net) that sequentially selects regions to zoom in. Second, we propose a novel framework, Temporal Recurrent Network (TRN), to model greater temporal context of a video frame by simultaneously performing online action detection and anticipation of the immediate future. At each moment in time, our approach makes use of both accumulated historical evidence and predicted future information to better recognize the action that is currently occurring, and integrates both of these into a unified end-to-end architecture. We evaluate our approach on two popular online action detection datasets, HDD and TVSeries, as well as another widely used dataset, THUMOS’14. Third, we propose StartNet to address Online Detection of Action Start (ODAS) where action starts and their associated categories are detected in untrimmed, streaming videos. Our method decomposes ODAS into two stages: action classification (using ClsNet) and start point localization (using LocNet). ClsNet focuses on per-frame labeling and predicts action score distributions online. Based on the predicted action scores of the past and current frames, LocNet conducts class-agnostic start detection by optimizing long-term localization rewards using policy gradient methods. The proposed framework is validated on two large-scale datasets, THUMOS’14 and ActivityNet. Fourth, we introduce Count-guided Weakly Supervised Localization (C-WSL), an approach that uses per-class object count as a new form of supervision to improve Weakly Supervised Localization (WSL). C-WSL uses a simple count-based region selection algorithm to select high-quality regions, each of which covers a single object instance during training, and improves existing WSL methods by training with the selected regions. To demonstrate the effectiveness of C-WSL, we integrate it into two WSL architectures and conduct extensive experiments on VOC2007 and VOC2012. In the last, we propose Weakly Supervised Language Localization Networks (WSLLN) to detect events in long, untrimmed videos given language queries. WSLLN relieves the annotation burden by training with only video-sentence pairs without accessing to temporal locations of events. With a simple end-to-end structure, WSLLN measures segment-text consistency and conducts segment selection (conditioned on the text) simultaneously. Results from both are merged and optimized as a video-sentence matching problem. Experiments are conducted on ActivityNet Captions and DiDeMo

    Object Detection in 20 Years: A Survey

    Full text link
    Object detection, as of one the most fundamental and challenging problems in computer vision, has received great attention in recent years. Its development in the past two decades can be regarded as an epitome of computer vision history. If we think of today's object detection as a technical aesthetics under the power of deep learning, then turning back the clock 20 years we would witness the wisdom of cold weapon era. This paper extensively reviews 400+ papers of object detection in the light of its technical evolution, spanning over a quarter-century's time (from the 1990s to 2019). A number of topics have been covered in this paper, including the milestone detectors in history, detection datasets, metrics, fundamental building blocks of the detection system, speed up techniques, and the recent state of the art detection methods. This paper also reviews some important detection applications, such as pedestrian detection, face detection, text detection, etc, and makes an in-deep analysis of their challenges as well as technical improvements in recent years.Comment: This work has been submitted to the IEEE TPAMI for possible publicatio

    RS-Net: robust segmentation of green overlapped apples

    Get PDF
    Fruit detection and segmentation will be essential for future agronomic management, with applications in yield estimation, growth monitoring, intelligent picking, disease detection and etc. In order to more accurately and efficiently realize the recognition and segmentation of apples in natural orchards, a robust segmentation net framework specially developed for fruit production is proposed. This model was improved for the more challenging problem which segments the overlapped apples from the monochromatic background regardless of various corruptions. The method extends Mask R-CNN by embedding an attention mechanism for focusing more on the informative pixels but also suppressing the noise caused by adverse factors (occlusions, overlaps, etc.), which could be more suitable and robust for operating in complex natural environment. Specifically, the Gaussian non-local attention mechanism is transplanted into Mask R-CNN for refining the semantic features generated continuously by residual network and feature pyramid network, then the model forward processing based on the balanced feature levels and finally segments the regions where the apples are located. Experimental results verify the hypothesis of current work and show that the proposed method outperforms other start-of-the-art detection and segmentation models, the AP box and AP mask metric values have reached 85.6% and 86.2% in a reasonable run time, respectively, which can meet the precision and robustness of vision system in agronomic managemen

    Scale Robust Deep Oriented-text Detection Network

    Get PDF
    Abstract(#br)Text detection is a prerequisite of text recognition, and multi-oriented text detection is a hot topic recently. The existing multi-oriented text detection methods fall short when facing two issues: 1) text scales change in a wide range, and 2) there exists the foreground-background class imbalance. In this paper, we propose a scale-robust deep multi-oriented text-detection model, which not only has the efficiency of the one-stage deep detection model, but also has the comparable accuracy of the two-stage deep text-detection model. We design the feature refining block to fuse multi-scale context features for the purpose of keeping text detection in a higher-resolution feature map. Moreover, in order to mitigate the foreground-background class imbalance, Focal Loss is adopted to up weight the hard-classified samples. Our method is implemented on four benchmark text datasets: ICDAR2013, ICDAR2015, COCO-Text and MSRA-TD500. The experimental results demonstrate that our method is superior to the existing one-stage deep text-detection models and comparable to the state-of-the-art text detection methods

    Advanced document data extraction techniques to improve supply chain performance

    Get PDF
    In this thesis, a novel machine learning technique to extract text-based information from scanned images has been developed. This information extraction is performed in the context of scanned invoices and bills used in financial transactions. These financial transactions contain a considerable amount of data that must be extracted, refined, and stored digitally before it can be used for analysis. Converting this data into a digital format is often a time-consuming process. Automation and data optimisation show promise as methods for reducing the time required and the cost of Supply Chain Management (SCM) processes, especially Supplier Invoice Management (SIM), Financial Supply Chain Management (FSCM) and Supply Chain procurement processes. This thesis uses a cross-disciplinary approach involving Computer Science and Operational Management to explore the benefit of automated invoice data extraction in business and its impact on SCM. The study adopts a multimethod approach based on empirical research, surveys, and interviews performed on selected companies.The expert system developed in this thesis focuses on two distinct areas of research: Text/Object Detection and Text Extraction. For Text/Object Detection, the Faster R-CNN model was analysed. While this model yields outstanding results in terms of object detection, it is limited by poor performance when image quality is low. The Generative Adversarial Network (GAN) model is proposed in response to this limitation. The GAN model is a generator network that is implemented with the help of the Faster R-CNN model and a discriminator that relies on PatchGAN. The output of the GAN model is text data with bonding boxes. For text extraction from the bounding box, a novel data extraction framework consisting of various processes including XML processing in case of existing OCR engine, bounding box pre-processing, text clean up, OCR error correction, spell check, type check, pattern-based matching, and finally, a learning mechanism for automatizing future data extraction was designed. Whichever fields the system can extract successfully are provided in key-value format.The efficiency of the proposed system was validated using existing datasets such as SROIE and VATI. Real-time data was validated using invoices that were collected by two companies that provide invoice automation services in various countries. Currently, these scanned invoices are sent to an OCR system such as OmniPage, Tesseract, or ABBYY FRE to extract text blocks and later, a rule-based engine is used to extract relevant data. While the system’s methodology is robust, the companies surveyed were not satisfied with its accuracy. Thus, they sought out new, optimized solutions. To confirm the results, the engines were used to return XML-based files with text and metadata identified. The output XML data was then fed into this new system for information extraction. This system uses the existing OCR engine and a novel, self-adaptive, learning-based OCR engine. This new engine is based on the GAN model for better text identification. Experiments were conducted on various invoice formats to further test and refine its extraction capabilities. For cost optimisation and the analysis of spend classification, additional data were provided by another company in London that holds expertise in reducing their clients' procurement costs. This data was fed into our system to get a deeper level of spend classification and categorisation. This helped the company to reduce its reliance on human effort and allowed for greater efficiency in comparison with the process of performing similar tasks manually using excel sheets and Business Intelligence (BI) tools.The intention behind the development of this novel methodology was twofold. First, to test and develop a novel solution that does not depend on any specific OCR technology. Second, to increase the information extraction accuracy factor over that of existing methodologies. Finally, it evaluates the real-world need for the system and the impact it would have on SCM. This newly developed method is generic and can extract text from any given invoice, making it a valuable tool for optimizing SCM. In addition, the system uses a template-matching approach to ensure the quality of the extracted information
    corecore