3,136 research outputs found
Mask R-CNN with Pyramid Attention Network for Scene Text Detection
In this paper, we present a new Mask R-CNN based text detection approach
which can robustly detect multi-oriented and curved text from natural scene
images in a unified manner. To enhance the feature representation ability of
Mask R-CNN for text detection tasks, we propose to use the Pyramid Attention
Network (PAN) as a new backbone network of Mask R-CNN. Experiments demonstrate
that PAN can suppress false alarms caused by text-like backgrounds more
effectively. Our proposed approach has achieved superior performance on both
multi-oriented (ICDAR-2015, ICDAR-2017 MLT) and curved (SCUT-CTW1500) text
detection benchmark tasks by only using single-scale and single-model testing.Comment: Accepted by WACV 201
Pyramid Mask Text Detector
Scene text detection, an essential step of scene text recognition system, is
to locate text instances in natural scene images automatically. Some recent
attempts benefiting from Mask R-CNN formulate scene text detection task as an
instance segmentation problem and achieve remarkable performance. In this
paper, we present a new Mask R-CNN based framework named Pyramid Mask Text
Detector (PMTD) to handle the scene text detection. Instead of binary text mask
generated by the existing Mask R-CNN based methods, our PMTD performs
pixel-level regression under the guidance of location-aware supervision,
yielding a more informative soft text mask for each text instance. As for the
generation of text boxes, PMTD reinterprets the obtained 2D soft mask into 3D
space and introduces a novel plane clustering algorithm to derive the optimal
text box on the basis of 3D shape. Experiments on standard datasets demonstrate
that the proposed PMTD brings consistent and noticeable gain and clearly
outperforms state-of-the-art methods. Specifically, it achieves an F-measure of
80.13% on ICDAR 2017 MLT dataset
TextBoxes++: A Single-Shot Oriented Scene Text Detector
Scene text detection is an important step of scene text recognition system
and also a challenging problem. Different from general object detection, the
main challenges of scene text detection lie on arbitrary orientations, small
sizes, and significantly variant aspect ratios of text in natural images. In
this paper, we present an end-to-end trainable fast scene text detector, named
TextBoxes++, which detects arbitrary-oriented scene text with both high
accuracy and efficiency in a single network forward pass. No post-processing
other than an efficient non-maximum suppression is involved. We have evaluated
the proposed TextBoxes++ on four public datasets. In all experiments,
TextBoxes++ outperforms competing methods in terms of text localization
accuracy and runtime. More specifically, TextBoxes++ achieves an f-measure of
0.817 at 11.6fps for 1024*1024 ICDAR 2015 Incidental text images, and an
f-measure of 0.5591 at 19.8fps for 768*768 COCO-Text images. Furthermore,
combined with a text recognizer, TextBoxes++ significantly outperforms the
state-of-the-art approaches for word spotting and end-to-end text recognition
tasks on popular benchmarks. Code is available at:
https://github.com/MhLiao/TextBoxes_plusplusComment: 15 page
Learning Markov Clustering Networks for Scene Text Detection
A novel framework named Markov Clustering Network (MCN) is proposed for fast
and robust scene text detection. MCN predicts instance-level bounding boxes by
firstly converting an image into a Stochastic Flow Graph (SFG) and then
performing Markov Clustering on this graph. Our method can detect text objects
with arbitrary size and orientation without prior knowledge of object size. The
stochastic flow graph encode objects' local correlation and semantic
information. An object is modeled as strongly connected nodes, which allows
flexible bottom-up detection for scale-varying and rotated objects. MCN
generates bounding boxes without using Non-Maximum Suppression, and it can be
fully parallelized on GPUs. The evaluation on public benchmarks shows that our
method outperforms the existing methods by a large margin in detecting
multioriented text objects. MCN achieves new state-of-art performance on
challenging MSRA-TD500 dataset with precision of 0.88, recall of 0.79 and
F-score of 0.83. Also, MCN achieves realtime inference with frame rate of 34
FPS, which is speedup when compared with the fastest scene text
detection algorithm
TextCohesion: Detecting Text for Arbitrary Shapes
In this paper, we propose a pixel-wise method named TextCohesion for scene
text detection, which splits a text instance into five key components: a Text
Skeleton and four Directional Pixel Regions. These components are easier to
handle than the entire text instance. A confidence scoring mechanism is
designed to filter characters that are similar to text. Our method can
integrate text contexts intensively when backgrounds are complex. Experiments
on two curved challenging benchmarks demonstrate that TextCohesion outperforms
state-of-the-art methods, achieving the F-measure of 84.6% on Total-Text and
bfseries86.3% on SCUT-CTW1500.Comment: Scene Text Detection Instance Segmentatio
Correlation Propagation Networks for Scene Text Detection
In this work, we propose a novel hybrid method for scene text detection
namely Correlation Propagation Network (CPN). It is an end-to-end trainable
framework engined by advanced Convolutional Neural Networks. Our CPN predicts
text objects according to both top-down observations and the bottom-up cues.
Multiple candidate boxes are assembled by a spatial communication mechanism
call Correlation Propagation (CP). The extracted spatial features by CNN are
regarded as node features in a latticed graph and Correlation Propagation
algorithm runs distributively on each node to update the hypothesis of
corresponding object centers. The CP process can flexibly handle scale-varying
and rotated text objects without using predefined bounding box templates.
Benefit from its distributive nature, CPN is computationally efficient and
enjoys a high level of parallelism. Moreover, we introduce deformable
convolution to the backbone network to enhance the adaptability to long texts.
The evaluation on public benchmarks shows that the proposed method achieves
state-of-art performance, and it significantly outperforms the existing methods
for handling multi-scale and multi-oriented text objects with much lower
computation cost
PuzzleNet: Scene Text Detection by Segment Context Graph Learning
Recently, a series of decomposition-based scene text detection methods has
achieved impressive progress by decomposing challenging text regions into
pieces and linking them in a bottom-up manner. However, most of them merely
focus on linking independent text pieces while the context information is
underestimated. In the puzzle game, the solver often put pieces together in a
logical way according to the contextual information of each piece, in order to
arrive at the correct solution. Inspired by it, we propose a novel
decomposition-based method, termed Puzzle Networks (PuzzleNet), to address the
challenging scene text detection task in this work. PuzzleNet consists of the
Segment Proposal Network (SPN) that predicts the candidate text segments
fitting arbitrary shape of text region, and the two-branch Multiple-Similarity
Graph Convolutional Network (MSGCN) that models both appearance and geometry
correlations between each segment to its contextual ones. By building segments
as context graphs, MSGCN effectively employs segment context to predict
combinations of segments. Final detections of polygon shape are produced by
merging segments according to the predicted combinations. Evaluations on three
benchmark datasets, ICDAR15, MSRA-TD500 and SCUT-CTW1500, have demonstrated
that our method can achieve better or comparable performance than current
state-of-the-arts, which is beneficial from the exploitation of segment context
graph
Multi-Oriented Scene Text Detection via Corner Localization and Region Segmentation
Previous deep learning based state-of-the-art scene text detection methods
can be roughly classified into two categories. The first category treats scene
text as a type of general objects and follows general object detection paradigm
to localize scene text by regressing the text box locations, but troubled by
the arbitrary-orientation and large aspect ratios of scene text. The second one
segments text regions directly, but mostly needs complex post processing. In
this paper, we present a method that combines the ideas of the two types of
methods while avoiding their shortcomings. We propose to detect scene text by
localizing corner points of text bounding boxes and segmenting text regions in
relative positions. In inference stage, candidate boxes are generated by
sampling and grouping corner points, which are further scored by segmentation
maps and suppressed by NMS. Compared with previous methods, our method can
handle long oriented text naturally and doesn't need complex post processing.
The experiments on ICDAR2013, ICDAR2015, MSRA-TD500, MLT and COCO-Text
demonstrate that the proposed algorithm achieves better or comparable results
in both accuracy and efficiency. Based on VGG16, it achieves an F-measure of
84.3% on ICDAR2015 and 81.5% on MSRA-TD500.Comment: To appear in CVPR201
SCRDet: Towards More Robust Detection for Small, Cluttered and Rotated Objects
Object detection has been a building block in computer vision. Though
considerable progress has been made, there still exist challenges for objects
with small size, arbitrary direction, and dense distribution. Apart from
natural images, such issues are especially pronounced for aerial images of
great importance. This paper presents a novel multi-category rotation detector
for small, cluttered and rotated objects, namely SCRDet. Specifically, a
sampling fusion network is devised which fuses multi-layer feature with
effective anchor sampling, to improve the sensitivity to small objects.
Meanwhile, the supervised pixel attention network and the channel attention
network are jointly explored for small and cluttered object detection by
suppressing the noise and highlighting the objects feature. For more accurate
rotation estimation, the IoU constant factor is added to the smooth L1 loss to
address the boundary problem for the rotating bounding box. Extensive
experiments on two remote sensing public datasets DOTA, NWPU VHR-10 as well as
natural image datasets COCO, VOC2007 and scene text data ICDAR2015 show the
state-of-the-art performance of our detector. The code and models will be
available at https://github.com/DetectionTeamUCAS.Comment: 10 pages, 10 figures, 6 tables, ICCV201
A Survey on Content-Aware Video Analysis for Sports
Sports data analysis is becoming increasingly large-scale, diversified, and
shared, but difficulty persists in rapidly accessing the most crucial
information. Previous surveys have focused on the methodologies of sports video
analysis from the spatiotemporal viewpoint instead of a content-based
viewpoint, and few of these studies have considered semantics. This study
develops a deeper interpretation of content-aware sports video analysis by
examining the insight offered by research into the structure of content under
different scenarios. On the basis of this insight, we provide an overview of
the themes particularly relevant to the research on content-aware systems for
broadcast sports. Specifically, we focus on the video content analysis
techniques applied in sportscasts over the past decade from the perspectives of
fundamentals and general review, a content hierarchical model, and trends and
challenges. Content-aware analysis methods are discussed with respect to
object-, event-, and context-oriented groups. In each group, the gap between
sensation and content excitement must be bridged using proper strategies. In
this regard, a content-aware approach is required to determine user demands.
Finally, the paper summarizes the future trends and challenges for sports video
analysis. We believe that our findings can advance the field of research on
content-aware video analysis for broadcast sports.Comment: Accepted for publication in IEEE Transactions on Circuits and Systems
for Video Technology (TCSVT
- …