54,892 research outputs found
Arbitrary Shape Scene Text Detection with Adaptive Text Region Representation
Scene text detection attracts much attention in computer vision, because it
can be widely used in many applications such as real-time text translation,
automatic information entry, blind person assistance, robot sensing and so on.
Though many methods have been proposed for horizontal and oriented texts,
detecting irregular shape texts such as curved texts is still a challenging
problem. To solve the problem, we propose a robust scene text detection method
with adaptive text region representation. Given an input image, a text region
proposal network is first used for extracting text proposals. Then, these
proposals are verified and refined with a refinement network. Here, recurrent
neural network based adaptive text region representation is proposed for text
region refinement, where a pair of boundary points are predicted each time step
until no new points are found. In this way, text regions of arbitrary shapes
are detected and represented with adaptive number of boundary points. This
gives more accurate description of text regions. Experimental results on five
benchmarks, namely, CTW1500, TotalText, ICDAR2013, ICDAR2015 and MSRATD500,
show that the proposed method achieves state-of-the-art in scene text
detection
Characterizing Adversarial Examples Based on Spatial Consistency Information for Semantic Segmentation
Deep Neural Networks (DNNs) have been widely applied in various recognition
tasks. However, recently DNNs have been shown to be vulnerable against
adversarial examples, which can mislead DNNs to make arbitrary incorrect
predictions. While adversarial examples are well studied in classification
tasks, other learning problems may have different properties. For instance,
semantic segmentation requires additional components such as dilated
convolutions and multiscale processing. In this paper, we aim to characterize
adversarial examples based on spatial context information in semantic
segmentation. We observe that spatial consistency information can be
potentially leveraged to detect adversarial examples robustly even when a
strong adaptive attacker has access to the model and detection strategies. We
also show that adversarial examples based on attacks considered within the
paper barely transfer among models, even though transferability is common in
classification. Our observations shed new light on developing adversarial
attacks and defenses to better understand the vulnerabilities of DNNs.Comment: Accepted to ECCV 201
Single Image Dehazing through Improved Atmospheric Light Estimation
Image contrast enhancement for outdoor vision is important for smart car
auxiliary transport systems. The video frames captured in poor weather
conditions are often characterized by poor visibility. Most image dehazing
algorithms consider to use a hard threshold assumptions or user input to
estimate atmospheric light. However, the brightest pixels sometimes are objects
such as car lights or streetlights, especially for smart car auxiliary
transport systems. Simply using a hard threshold may cause a wrong estimation.
In this paper, we propose a single optimized image dehazing method that
estimates atmospheric light efficiently and removes haze through the estimation
of a semi-globally adaptive filter. The enhanced images are characterized with
little noise and good exposure in dark regions. The textures and edges of the
processed images are also enhanced significantly.Comment: Multimedia Tools and Applications (2015
Background Subtraction in Real Applications: Challenges, Current Models and Future Directions
Computer vision applications based on videos often require the detection of
moving objects in their first step. Background subtraction is then applied in
order to separate the background and the foreground. In literature, background
subtraction is surely among the most investigated field in computer vision
providing a big amount of publications. Most of them concern the application of
mathematical and machine learning models to be more robust to the challenges
met in videos. However, the ultimate goal is that the background subtraction
methods developed in research could be employed in real applications like
traffic surveillance. But looking at the literature, we can remark that there
is often a gap between the current methods used in real applications and the
current methods in fundamental research. In addition, the videos evaluated in
large-scale datasets are not exhaustive in the way that they only covered a
part of the complete spectrum of the challenges met in real applications. In
this context, we attempt to provide the most exhaustive survey as possible on
real applications that used background subtraction in order to identify the
real challenges met in practice, the current used background models and to
provide future directions. Thus, challenges are investigated in terms of
camera, foreground objects and environments. In addition, we identify the
background models that are effectively used in these applications in order to
find potential usable recent background models in terms of robustness, time and
memory requirements.Comment: Submitted to Computer Science Revie
Skeleton Matching based approach for Text Localization in Scene Images
In this paper, we propose a skeleton matching based approach which aids in
text localization in scene images. The input image is preprocessed and
segmented into blocks using connected component analysis. We obtain the
skeleton of the segmented block using morphology based approach. The
skeletonized images are compared with the trained templates in the database to
categorize into text and non-text blocks. Further, the newly designed
geometrical rules and morphological operations are employed on the detected
text blocks for scene text localization. The experimental results obtained on
publicly available standard datasets illustrate that the proposed method can
detect and localize the texts of various sizes, fonts and colors.Comment: 10 pages, 8 figures, Eighth International Conference on Image and
Signal Processing,Elsevier Publications,pp: 145-153, held at UVCE, Bangalore
in July 2014. ISBN: 978935107252
AdaDNNs: Adaptive Ensemble of Deep Neural Networks for Scene Text Recognition
Recognizing text in the wild is a really challenging task because of complex
backgrounds, various illuminations and diverse distortions, even with deep
neural networks (convolutional neural networks and recurrent neural networks).
In the end-to-end training procedure for scene text recognition, the outputs of
deep neural networks at different iterations are always demonstrated with
diversity and complementarity for the target object (text). Here, a simple but
effective deep learning method, an adaptive ensemble of deep neural networks
(AdaDNNs), is proposed to simply select and adaptively combine classifier
components at different iterations from the whole learning system. Furthermore,
the ensemble is formulated as a Bayesian framework for classifier weighting and
combination. A variety of experiments on several typical acknowledged
benchmarks, i.e., ICDAR Robust Reading Competition (Challenge 1, 2 and 4)
datasets, verify the surprised improvement from the baseline DNNs, and the
effectiveness of AdaDNNs compared with the recent state-of-the-art methods
Compressive Hyperspectral Imaging via Approximate Message Passing
We consider a compressive hyperspectral imaging reconstruction problem, where
three-dimensional spatio-spectral information about a scene is sensed by a
coded aperture snapshot spectral imager (CASSI). The CASSI imaging process can
be modeled as suppressing three-dimensional coded and shifted voxels and
projecting these onto a two-dimensional plane, such that the number of acquired
measurements is greatly reduced. On the other hand, because the measurements
are highly compressive, the reconstruction process becomes challenging. We
previously proposed a compressive imaging reconstruction algorithm that is
applied to two-dimensional images based on the approximate message passing
(AMP) framework. AMP is an iterative algorithm that can be used in signal and
image reconstruction by performing denoising at each iteration. We employed an
adaptive Wiener filter as the image denoiser, and called our algorithm
"AMP-Wiener." In this paper, we extend AMP-Wiener to three-dimensional
hyperspectral image reconstruction, and call it "AMP-3D-Wiener." Applying the
AMP framework to the CASSI system is challenging, because the matrix that
models the CASSI system is highly sparse, and such a matrix is not suitable to
AMP and makes it difficult for AMP to converge. Therefore, we modify the
adaptive Wiener filter and employ a technique called damping to solve for the
divergence issue of AMP. Our approach is applied in nature, and the numerical
experiments show that AMP-3D-Wiener outperforms existing widely-used algorithms
such as gradient projection for sparse reconstruction (GPSR) and two-step
iterative shrinkage/thresholding (TwIST) given a similar amount of runtime.
Moreover, in contrast to GPSR and TwIST, AMP-3D-Wiener need not tune any
parameters, which simplifies the reconstruction process.Comment: 13 pages, to appear in Journal of Selected Topics in Signal
Processin
Hierarchy Composition GAN for High-fidelity Image Synthesis
Despite the rapid progress of generative adversarial networks (GANs) in image
synthesis in recent years, the existing image synthesis approaches work in
either geometry domain or appearance domain alone which often introduces
various synthesis artifacts. This paper presents an innovative Hierarchical
Composition GAN (HIC-GAN) that incorporates image synthesis in geometry and
appearance domains into an end-to-end trainable network and achieves superior
synthesis realism in both domains simultaneously. We design an innovative
hierarchical composition mechanism that is capable of learning realistic
composition geometry and handling occlusions while multiple foreground objects
are involved in image composition. In addition, we introduce a novel attention
mask mechanism that guides to adapt the appearance of foreground objects which
also helps to provide better training reference for learning in geometry
domain. Extensive experiments on scene text image synthesis, portrait editing
and indoor rendering tasks show that the proposed HIC-GAN achieves superior
synthesis performance qualitatively and quantitatively.Comment: 11 pages, 8 figure
Audio-visual foreground extraction for event characterization
This paper presents a new method able to integrate audio and visual information for scene analysis in a typical surveillance scenario, using only one camera and one monaural microphone. Visual information is analyzed by a standard visual background/foreground (BG/FG) modelling module, enhanced with a novelty detection stage, and coupled with an audio BG/FG modelling scheme. The audiovisual association is performed on-line, by exploiting the concept of synchrony. Experimental tests carrying out classification and clustering of events show all the potentialities of the proposed approach, also in comparison with the results obtained by using the single modalities
Spatio-Temporal Filter Adaptive Network for Video Deblurring
Video deblurring is a challenging task due to the spatially variant blur
caused by camera shake, object motions, and depth variations, etc. Existing
methods usually estimate optical flow in the blurry video to align consecutive
frames or approximate blur kernels. However, they tend to generate artifacts or
cannot effectively remove blur when the estimated optical flow is not accurate.
To overcome the limitation of separate optical flow estimation, we propose a
Spatio-Temporal Filter Adaptive Network (STFAN) for the alignment and
deblurring in a unified framework. The proposed STFAN takes both blurry and
restored images of the previous frame as well as blurry image of the current
frame as input, and dynamically generates the spatially adaptive filters for
the alignment and deblurring. We then propose the new Filter Adaptive
Convolutional (FAC) layer to align the deblurred features of the previous frame
with the current frame and remove the spatially variant blur from the features
of the current frame. Finally, we develop a reconstruction network which takes
the fusion of two transformed features to restore the clear frames. Both
quantitative and qualitative evaluation results on the benchmark datasets and
real-world videos demonstrate that the proposed algorithm performs favorably
against state-of-the-art methods in terms of accuracy, speed as well as model
size.Comment: ICCV 201
- …