68,904 research outputs found
FOTS: Fast Oriented Text Spotting with a Unified Network
Incidental scene text spotting is considered one of the most difficult and
valuable challenges in the document analysis community. Most existing methods
treat text detection and recognition as separate tasks. In this work, we
propose a unified end-to-end trainable Fast Oriented Text Spotting (FOTS)
network for simultaneous detection and recognition, sharing computation and
visual information among the two complementary tasks. Specially, RoIRotate is
introduced to share convolutional features between detection and recognition.
Benefiting from convolution sharing strategy, our FOTS has little computation
overhead compared to baseline text detection network, and the joint training
method learns more generic features to make our method perform better than
these two-stage methods. Experiments on ICDAR 2015, ICDAR 2017 MLT, and ICDAR
2013 datasets demonstrate that the proposed method outperforms state-of-the-art
methods significantly, which further allows us to develop the first real-time
oriented text spotting system which surpasses all previous state-of-the-art
results by more than 5% on ICDAR 2015 text spotting task while keeping 22.6
fps.Comment: 10 pages, 6 figure
Visual saliency computation for image analysis
Visual saliency computation is about detecting and understanding salient regions and elements in a visual scene. Algorithms for visual saliency computation can give clues to where people will look in images, what objects are visually prominent in a scene, etc. Such algorithms could be useful in a wide range of applications in computer vision and graphics. In this thesis, we study the following visual saliency computation problems. 1) Eye Fixation Prediction. Eye fixation prediction aims to predict where people look in a visual scene. For this problem, we propose a Boolean Map Saliency (BMS) model which leverages the global surroundedness cue using a Boolean map representation. We draw a theoretic connection between BMS and the Minimum Barrier Distance (MBD) transform to provide insight into our algorithm. Experiment results show that BMS compares favorably with state-of-the-art methods on seven benchmark datasets. 2) Salient Region Detection. Salient region detection entails computing a saliency map that highlights the regions of dominant objects in a scene. We propose a salient region detection method based on the Minimum Barrier Distance (MBD) transform. We present a fast approximate MBD transform algorithm with an error bound analysis. Powered by this fast MBD transform algorithm, our method can run at about 80 FPS and achieve state-of-the-art performance on four benchmark datasets. 3) Salient Object Detection. Salient object detection targets at localizing each salient object instance in an image. We propose a method using a Convolutional Neural Network (CNN) model for proposal generation and a novel subset optimization formulation for bounding box filtering. In experiments, our subset optimization formulation consistently outperforms heuristic bounding box filtering baselines, such as Non-maximum Suppression, and our method substantially outperforms previous methods on three challenging datasets. 4) Salient Object Subitizing. We propose a new visual saliency computation task, called Salient Object Subitizing, which is to predict the existence and the number of salient objects in an image using holistic cues. To this end, we present an image dataset of about 14K everyday images which are annotated using an online crowdsourcing marketplace. We show that an end-to-end trained CNN subitizing model can achieve promising performance without requiring any localization process. A method is proposed to further improve the training of the CNN subitizing model by leveraging synthetic images. 5) Top-down Saliency Detection. Unlike the aforementioned tasks, top-down saliency detection entails generating task-specific saliency maps. We propose a weakly supervised top-down saliency detection approach by modeling the top-down attention of a CNN image classifier. We propose Excitation Backprop and the concept of contrastive attention to generate highly discriminative top-down saliency maps. Our top-down saliency detection method achieves superior performance in weakly supervised localization tasks on challenging datasets. The usefulness of our method is further validated in the text-to-region association task, where our method provides state-of-the-art performance using only weakly labeled web images for training
A Human Eye-based Text Color Scheme Generation Method for Image Synthesis
Synthetic data used for scene text detection and recognition tasks have
proven effective. However, there are still two problems: First, the color
schemes used for text coloring in the existing methods are relatively fixed
color key-value pairs learned from real datasets. The dirty data in real
datasets may cause the problem that the colors of text and background are too
similar to be distinguished from each other. Second, the generated texts are
uniformly limited to the same depth of a picture, while there are special cases
in the real world that text may appear across depths. To address these
problems, in this paper we design a novel method to generate color schemes,
which are consistent with the characteristics of human eyes to observe things.
The advantages of our method are as follows: (1) overcomes the color confusion
problem between text and background caused by dirty data; (2) the texts
generated are allowed to appear in most locations of any image, even across
depths; (3) avoids analyzing the depth of background, such that the performance
of our method exceeds the state-of-the-art methods; (4) the speed of generating
images is fast, nearly one picture generated per three milliseconds. The
effectiveness of our method is verified on several public datasets.Comment: Accepted by EITCE 2022, No.QJE77JVOL
EAST: An Efficient and Accurate Scene Text Detector
Previous approaches for scene text detection have already achieved promising
performances across various benchmarks. However, they usually fall short when
dealing with challenging scenarios, even when equipped with deep neural network
models, because the overall performance is determined by the interplay of
multiple stages and components in the pipelines. In this work, we propose a
simple yet powerful pipeline that yields fast and accurate text detection in
natural scenes. The pipeline directly predicts words or text lines of arbitrary
orientations and quadrilateral shapes in full images, eliminating unnecessary
intermediate steps (e.g., candidate aggregation and word partitioning), with a
single neural network. The simplicity of our pipeline allows concentrating
efforts on designing loss functions and neural network architecture.
Experiments on standard datasets including ICDAR 2015, COCO-Text and MSRA-TD500
demonstrate that the proposed algorithm significantly outperforms
state-of-the-art methods in terms of both accuracy and efficiency. On the ICDAR
2015 dataset, the proposed algorithm achieves an F-score of 0.7820 at 13.2fps
at 720p resolution.Comment: Accepted to CVPR 2017, fix equation (3
- …