7,561 research outputs found
Learning Markov Clustering Networks for Scene Text Detection
A novel framework named Markov Clustering Network (MCN) is proposed for fast
and robust scene text detection. MCN predicts instance-level bounding boxes by
firstly converting an image into a Stochastic Flow Graph (SFG) and then
performing Markov Clustering on this graph. Our method can detect text objects
with arbitrary size and orientation without prior knowledge of object size. The
stochastic flow graph encode objects' local correlation and semantic
information. An object is modeled as strongly connected nodes, which allows
flexible bottom-up detection for scale-varying and rotated objects. MCN
generates bounding boxes without using Non-Maximum Suppression, and it can be
fully parallelized on GPUs. The evaluation on public benchmarks shows that our
method outperforms the existing methods by a large margin in detecting
multioriented text objects. MCN achieves new state-of-art performance on
challenging MSRA-TD500 dataset with precision of 0.88, recall of 0.79 and
F-score of 0.83. Also, MCN achieves realtime inference with frame rate of 34
FPS, which is speedup when compared with the fastest scene text
detection algorithm
Word Searching in Scene Image and Video Frame in Multi-Script Scenario using Dynamic Shape Coding
Retrieval of text information from natural scene images and video frames is a
challenging task due to its inherent problems like complex character shapes,
low resolution, background noise, etc. Available OCR systems often fail to
retrieve such information in scene/video frames. Keyword spotting, an
alternative way to retrieve information, performs efficient text searching in
such scenarios. However, current word spotting techniques in scene/video images
are script-specific and they are mainly developed for Latin script. This paper
presents a novel word spotting framework using dynamic shape coding for text
retrieval in natural scene image and video frames. The framework is designed to
search query keyword from multiple scripts with the help of on-the-fly
script-wise keyword generation for the corresponding script. We have used a
two-stage word spotting approach using Hidden Markov Model (HMM) to detect the
translated keyword in a given text line by identifying the script of the line.
A novel unsupervised dynamic shape coding based scheme has been used to group
similar shape characters to avoid confusion and to improve text alignment.
Next, the hypotheses locations are verified to improve retrieval performance.
To evaluate the proposed system for searching keyword from natural scene image
and video frames, we have considered two popular Indic scripts such as Bangla
(Bengali) and Devanagari along with English. Inspired by the zone-wise
recognition approach in Indic scripts[1], zone-wise text information has been
used to improve the traditional word spotting performance in Indic scripts. For
our experiment, a dataset consisting of images of different scenes and video
frames of English, Bangla and Devanagari scripts were considered. The results
obtained showed the effectiveness of our proposed word spotting approach.Comment: Multimedia Tools and Applications, Springe
Detecting Text in the Wild with Deep Character Embedding Network
Most text detection methods hypothesize texts are horizontal or
multi-oriented and thus define quadrangles as the basic detection unit.
However, text in the wild is usually perspectively distorted or curved, which
can not be easily tackled by existing approaches. In this paper, we propose a
deep character embedding network (CENet) which simultaneously predicts the
bounding boxes of characters and their embedding vectors, thus making text
detection a simple clustering task in the character embedding space. The
proposed method does not require strong assumptions of forming a straight line
on general text detection, which provides flexibility on arbitrarily curved or
perspectively distorted text. For character detection task, a dense prediction
subnetwork is designed to obtain the confidence score and bounding boxes of
characters. For character embedding task, a subnet is trained with contrastive
loss to project detected characters into embedding space. The two tasks share a
backbone CNN from which the multi-scale feature maps are extracted. The final
text regions can be easily achieved by a thresholding process on character
confidence and embedding distance of character pairs. We evaluated our method
on ICDAR13, ICDAR15, MSRA-TD500, and Total-Text. The proposed method achieves
state-of-the-art or comparable performance on all these datasets, and shows
substantial improvement in the irregular-text datasets, i.e. Total-Text.Comment: Asian Conference on Computer Vision 201
Correlation Propagation Networks for Scene Text Detection
In this work, we propose a novel hybrid method for scene text detection
namely Correlation Propagation Network (CPN). It is an end-to-end trainable
framework engined by advanced Convolutional Neural Networks. Our CPN predicts
text objects according to both top-down observations and the bottom-up cues.
Multiple candidate boxes are assembled by a spatial communication mechanism
call Correlation Propagation (CP). The extracted spatial features by CNN are
regarded as node features in a latticed graph and Correlation Propagation
algorithm runs distributively on each node to update the hypothesis of
corresponding object centers. The CP process can flexibly handle scale-varying
and rotated text objects without using predefined bounding box templates.
Benefit from its distributive nature, CPN is computationally efficient and
enjoys a high level of parallelism. Moreover, we introduce deformable
convolution to the backbone network to enhance the adaptability to long texts.
The evaluation on public benchmarks shows that the proposed method achieves
state-of-art performance, and it significantly outperforms the existing methods
for handling multi-scale and multi-oriented text objects with much lower
computation cost
A Survey on Content-Aware Video Analysis for Sports
Sports data analysis is becoming increasingly large-scale, diversified, and
shared, but difficulty persists in rapidly accessing the most crucial
information. Previous surveys have focused on the methodologies of sports video
analysis from the spatiotemporal viewpoint instead of a content-based
viewpoint, and few of these studies have considered semantics. This study
develops a deeper interpretation of content-aware sports video analysis by
examining the insight offered by research into the structure of content under
different scenarios. On the basis of this insight, we provide an overview of
the themes particularly relevant to the research on content-aware systems for
broadcast sports. Specifically, we focus on the video content analysis
techniques applied in sportscasts over the past decade from the perspectives of
fundamentals and general review, a content hierarchical model, and trends and
challenges. Content-aware analysis methods are discussed with respect to
object-, event-, and context-oriented groups. In each group, the gap between
sensation and content excitement must be bridged using proper strategies. In
this regard, a content-aware approach is required to determine user demands.
Finally, the paper summarizes the future trends and challenges for sports video
analysis. We believe that our findings can advance the field of research on
content-aware video analysis for broadcast sports.Comment: Accepted for publication in IEEE Transactions on Circuits and Systems
for Video Technology (TCSVT
Pyramid Mask Text Detector
Scene text detection, an essential step of scene text recognition system, is
to locate text instances in natural scene images automatically. Some recent
attempts benefiting from Mask R-CNN formulate scene text detection task as an
instance segmentation problem and achieve remarkable performance. In this
paper, we present a new Mask R-CNN based framework named Pyramid Mask Text
Detector (PMTD) to handle the scene text detection. Instead of binary text mask
generated by the existing Mask R-CNN based methods, our PMTD performs
pixel-level regression under the guidance of location-aware supervision,
yielding a more informative soft text mask for each text instance. As for the
generation of text boxes, PMTD reinterprets the obtained 2D soft mask into 3D
space and introduces a novel plane clustering algorithm to derive the optimal
text box on the basis of 3D shape. Experiments on standard datasets demonstrate
that the proposed PMTD brings consistent and noticeable gain and clearly
outperforms state-of-the-art methods. Specifically, it achieves an F-measure of
80.13% on ICDAR 2017 MLT dataset
Fusion Based Holistic Road Scene Understanding
This paper addresses the problem of holistic road scene understanding based
on the integration of visual and range data. To achieve the grand goal, we
propose an approach that jointly tackles object-level image segmentation and
semantic region labeling within a conditional random field (CRF) framework.
Specifically, we first generate semantic object hypotheses by clustering 3D
points, learning their prior appearance models, and using a deep learning
method for reasoning their semantic categories. The learned priors, together
with spatial and geometric contexts, are incorporated in CRF. With this
formulation, visual and range data are fused thoroughly, and moreover, the
coupled segmentation and semantic labeling problem can be inferred via Graph
Cuts. Our approach is validated on the challenging KITTI dataset that contains
diverse complicated road scenarios. Both quantitative and qualitative
evaluations demonstrate its effectiveness.Comment: 14 pages,11 figure
Latent Variable Algorithms for Multimodal Learning and Sensor Fusion
Multimodal learning has been lacking principled ways of combining information
from different modalities and learning a low-dimensional manifold of meaningful
representations. We study multimodal learning and sensor fusion from a latent
variable perspective. We first present a regularized recurrent attention filter
for sensor fusion. This algorithm can dynamically combine information from
different types of sensors in a sequential decision making task. Each sensor is
bonded with a modular neural network to maximize utility of its own
information. A gating modular neural network dynamically generates a set of
mixing weights for outputs from sensor networks by balancing utility of all
sensors' information. We design a co-learning mechanism to encourage
co-adaption and independent learning of each sensor at the same time, and
propose a regularization based co-learning method. In the second part, we focus
on recovering the manifold of latent representation. We propose a co-learning
approach using probabilistic graphical model which imposes a structural prior
on the generative model: multimodal variational RNN (MVRNN) model, and derive a
variational lower bound for its objective functions. In the third part, we
extend the siamese structure to sensor fusion for robust acoustic event
detection. We perform experiments to investigate the latent representations
that are extracted; works will be done in the following months. Our experiments
show that the recurrent attention filter can dynamically combine different
sensor inputs according to the information carried in the inputs. We consider
MVRNN can identify latent representations that are useful for many downstream
tasks such as speech synthesis, activity recognition, and control and planning.
Both algorithms are general frameworks which can be applied to other tasks
where different types of sensors are jointly used for decision making
Joint Energy-based Detection and Classificationon of Multilingual Text Lines
This paper proposes a new hierarchical MDL-based model for a joint detection
and classification of multilingual text lines in im- ages taken by hand-held
cameras. The majority of related text detec- tion methods assume alphabet-based
writing in a single language, e.g. in Latin. They use simple clustering
heuristics specific to such texts: prox- imity between letters within one line,
larger distance between separate lines, etc. We are interested in a
significantly more ambiguous problem where images combine alphabet and
logographic characters from multiple languages and typographic rules vary a lot
(e.g. English, Korean, and Chinese). Complexity of detecting and classifying
text lines in multiple languages calls for a more principled approach based on
information- theoretic principles. Our new MDL model includes data costs
combining geometric errors with classification likelihoods and a hierarchical
sparsity term based on label costs. This energy model can be efficiently
minimized by fusion moves. We demonstrate robustness of the proposed algorithm
on a large new database of multilingual text images collected in the pub- lic
transit system of Seoul
Look More Than Once: An Accurate Detector for Text of Arbitrary Shapes
Previous scene text detection methods have progressed substantially over the
past years. However, limited by the receptive field of CNNs and the simple
representations like rectangle bounding box or quadrangle adopted to describe
text, previous methods may fall short when dealing with more challenging text
instances, such as extremely long text and arbitrarily shaped text. To address
these two problems, we present a novel text detector namely LOMO, which
localizes the text progressively for multiple times (or in other word, LOok
More than Once). LOMO consists of a direct regressor (DR), an iterative
refinement module (IRM) and a shape expression module (SEM). At first, text
proposals in the form of quadrangle are generated by DR branch. Next, IRM
progressively perceives the entire long text by iterative refinement based on
the extracted feature blocks of preliminary proposals. Finally, a SEM is
introduced to reconstruct more precise representation of irregular text by
considering the geometry properties of text instance, including text region,
text center line and border offsets. The state-of-the-art results on several
public benchmarks including ICDAR2017-RCTW, SCUT-CTW1500, Total-Text, ICDAR2015
and ICDAR17-MLT confirm the striking robustness and effectiveness of LOMO.Comment: Accepted by CVPR1
- …