723 research outputs found
Customized mask region based convolutional neural networks for un-uniformed shape text detection and text recognition
In image scene, text contains high-level of important information that helps to analyze and consider the particular environment. In this paper, we adapt image mask and original identification of the mask region based convolutional neural networks (R-CNN) to allow recognition at 3 levels such as sequence, holistic and pixel-level semantics. Particularly, pixel and holistic level semantics can be utilized to recognize the texts and define the text shapes, respectively. Precisely, in mask and detection, we segment and recognize both character and word instances. Furthermore, we implement text detection through the outcome of instance segmentation on 2-D feature-space. Also, to tackle and identify the text issues of smaller and blurry texts, we consider text recognition by attention-based of optical character recognition (OCR) model with the mask R-CNN at sequential level. The OCR module is used to estimate character sequence through feature maps of the word instances in sequence to sequence. Finally, we proposed a fine-grained learning technique that trains a more accurate and robust model by learning models from the annotated datasets at the word level. Our proposed approach is evaluated on popular benchmark dataset ICDAR 2013 and ICDAR 2015
Scene Text Synthesis for Efficient and Effective Deep Network Training
A large amount of annotated training images is critical for training accurate
and robust deep network models but the collection of a large amount of
annotated training images is often time-consuming and costly. Image synthesis
alleviates this constraint by generating annotated training images
automatically by machines which has attracted increasing interest in the recent
deep learning research. We develop an innovative image synthesis technique that
composes annotated training images by realistically embedding foreground
objects of interest (OOI) into background images. The proposed technique
consists of two key components that in principle boost the usefulness of the
synthesized images in deep network training. The first is context-aware
semantic coherence which ensures that the OOI are placed around semantically
coherent regions within the background image. The second is harmonious
appearance adaptation which ensures that the embedded OOI are agreeable to the
surrounding background from both geometry alignment and appearance realism. The
proposed technique has been evaluated over two related but very different
computer vision challenges, namely, scene text detection and scene text
recognition. Experiments over a number of public datasets demonstrate the
effectiveness of our proposed image synthesis technique - the use of our
synthesized images in deep network training is capable of achieving similar or
even better scene text detection and scene text recognition performance as
compared with using real images.Comment: 8 pages, 5 figure
Convolutional Neural Networks Refitting by Bootstrapping for Tracking People in a Mobile Robot
[EN] Convolutional Neural Networks are usually fitted with manually labelled data. The labelling process is very time-consuming since large datasets are required. The use of external hardware may help in some cases, but it also introduces noise to the labelled data. In this paper, we pose a new data labelling approach by using bootstrapping to increase the accuracy of the PeTra tool. PeTra allows a mobile robot to estimate people's location in its environment by using a LIDAR sensor and a Convolutional Neural Network. PeTra has some limitations in specific situations, such as scenarios where there are not any people. We propose to use the actual PeTra release to label the LIDAR data used to fit the Convolutional Neural Network. We have evaluated the resulting system by comparing it with the previous one-where LIDAR data were labelled with a Real Time Location System. The new release increases the MCC-score by 65.97%.SIAgencia Estatal de InvestigaciónUniversidad de LeónInstituto Nacional de CiberseguridadThe research described in this article has been partially funded by the grant RTI2018-100683-B-I00 funded by MCIN/AEI/ 10.13039/501100011033 and by “ERDF A way of making Europe”; Instituto Nacional de Ciberseguridad de España (INCIBE), under the grant “ADENDA 4: Detección de nuevas amenazas y patrones desconocidos (Red Regional de Ciencia y Tecnología)”, addendum to the framework agreement INCIBE–Universidad de León, 2019–2021; and the regional Government of Castilla y León under under the grant BDNS (487971)
Towards Robust Real-Time Scene Text Detection: From Semantic to Instance Representation Learning
Due to the flexible representation of arbitrary-shaped scene text and simple
pipeline, bottom-up segmentation-based methods begin to be mainstream in
real-time scene text detection. Despite great progress, these methods show
deficiencies in robustness and still suffer from false positives and instance
adhesion. Different from existing methods which integrate multiple-granularity
features or multiple outputs, we resort to the perspective of representation
learning in which auxiliary tasks are utilized to enable the encoder to jointly
learn robust features with the main task of per-pixel classification during
optimization. For semantic representation learning, we propose global-dense
semantic contrast (GDSC), in which a vector is extracted for global semantic
representation, then used to perform element-wise contrast with the dense grid
features. To learn instance-aware representation, we propose to combine
top-down modeling (TDM) with the bottom-up framework to provide implicit
instance-level clues for the encoder. With the proposed GDSC and TDM, the
encoder network learns stronger representation without introducing any
parameters and computations during inference. Equipped with a very light
decoder, the detector can achieve more robust real-time scene text detection.
Experimental results on four public datasets show that the proposed method can
outperform or be comparable to the state-of-the-art on both accuracy and speed.
Specifically, the proposed method achieves 87.2% F-measure with 48.2 FPS on
Total-Text and 89.6% F-measure with 36.9 FPS on MSRA-TD500 on a single GeForce
RTX 2080 Ti GPU.Comment: Accepted by ACM MM 202
Towards Robust Curve Text Detection with Conditional Spatial Expansion
It is challenging to detect curve texts due to their irregular shapes and
varying sizes. In this paper, we first investigate the deficiency of the
existing curve detection methods and then propose a novel Conditional Spatial
Expansion (CSE) mechanism to improve the performance of curve text detection.
Instead of regarding the curve text detection as a polygon regression or a
segmentation problem, we treat it as a region expansion process. Our CSE starts
with a seed arbitrarily initialized within a text region and progressively
merges neighborhood regions based on the extracted local features by a CNN and
contextual information of merged regions. The CSE is highly parameterized and
can be seamlessly integrated into existing object detection frameworks.
Enhanced by the data-dependent CSE mechanism, our curve text detection system
provides robust instance-level text region extraction with minimal
post-processing. The analysis experiment shows that our CSE can handle texts
with various shapes, sizes, and orientations, and can effectively suppress the
false-positives coming from text-like textures or unexpected texts included in
the same RoI. Compared with the existing curve text detection algorithms, our
method is more robust and enjoys a simpler processing flow. It also creates a
new state-of-art performance on curve text benchmarks with F-score of up to
78.4.Comment: This paper has been accepted by IEEE International Conference on
Computer Vision and Pattern Recognition (CVPR 2019
Real-time Scene Text Detection with Differentiable Binarization
Recently, segmentation-based methods are quite popular in scene text
detection, as the segmentation results can more accurately describe scene text
of various shapes such as curve text. However, the post-processing of
binarization is essential for segmentation-based detection, which converts
probability maps produced by a segmentation method into bounding boxes/regions
of text. In this paper, we propose a module named Differentiable Binarization
(DB), which can perform the binarization process in a segmentation network.
Optimized along with a DB module, a segmentation network can adaptively set the
thresholds for binarization, which not only simplifies the post-processing but
also enhances the performance of text detection. Based on a simple segmentation
network, we validate the performance improvements of DB on five benchmark
datasets, which consistently achieves state-of-the-art results, in terms of
both detection accuracy and speed. In particular, with a light-weight backbone,
the performance improvements by DB are significant so that we can look for an
ideal tradeoff between detection accuracy and efficiency. Specifically, with a
backbone of ResNet-18, our detector achieves an F-measure of 82.8, running at
62 FPS, on the MSRA-TD500 dataset. Code is available at:
https://github.com/MhLiao/DBComment: Accepted to AAAI 202
- …