5,173 research outputs found
Visual Saliency Based on Multiscale Deep Features
Visual saliency is a fundamental problem in both cognitive and computational
sciences, including computer vision. In this CVPR 2015 paper, we discover that
a high-quality visual saliency model can be trained with multiscale features
extracted using a popular deep learning architecture, convolutional neural
networks (CNNs), which have had many successes in visual recognition tasks. For
learning such saliency models, we introduce a neural network architecture,
which has fully connected layers on top of CNNs responsible for extracting
features at three different scales. We then propose a refinement method to
enhance the spatial coherence of our saliency results. Finally, aggregating
multiple saliency maps computed for different levels of image segmentation can
further boost the performance, yielding saliency maps better than those
generated from a single segmentation. To promote further research and
evaluation of visual saliency models, we also construct a new large database of
4447 challenging images and their pixelwise saliency annotation. Experimental
results demonstrate that our proposed method is capable of achieving
state-of-the-art performance on all public benchmarks, improving the F-Measure
by 5.0% and 13.2% respectively on the MSRA-B dataset and our new dataset
(HKU-IS), and lowering the mean absolute error by 5.7% and 35.1% respectively
on these two datasets.Comment: To appear in CVPR 201
Hierarchical Cross-modal Transformer for RGB-D Salient Object Detection
Most of existing RGB-D salient object detection (SOD) methods follow the
CNN-based paradigm, which is unable to model long-range dependencies across
space and modalities due to the natural locality of CNNs. Here we propose the
Hierarchical Cross-modal Transformer (HCT), a new multi-modal transformer, to
tackle this problem. Unlike previous multi-modal transformers that directly
connecting all patches from two modalities, we explore the cross-modal
complementarity hierarchically to respect the modality gap and spatial
discrepancy in unaligned regions. Specifically, we propose to use intra-modal
self-attention to explore complementary global contexts, and measure
spatial-aligned inter-modal attention locally to capture cross-modal
correlations. In addition, we present a Feature Pyramid module for Transformer
(FPT) to boost informative cross-scale integration as well as a
consistency-complementarity module to disentangle the multi-modal integration
path and improve the fusion adaptivity. Comprehensive experiments on a large
variety of public datasets verify the efficacy of our designs and the
consistent improvement over state-of-the-art models.Comment: 10 pages, 10 figure
Inner and Inter Label Propagation: Salient Object Detection in the Wild
In this paper, we propose a novel label propagation based method for saliency
detection. A key observation is that saliency in an image can be estimated by
propagating the labels extracted from the most certain background and object
regions. For most natural images, some boundary superpixels serve as the
background labels and the saliency of other superpixels are determined by
ranking their similarities to the boundary labels based on an inner propagation
scheme. For images of complex scenes, we further deploy a 3-cue-center-biased
objectness measure to pick out and propagate foreground labels. A
co-transduction algorithm is devised to fuse both boundary and objectness
labels based on an inter propagation scheme. The compactness criterion decides
whether the incorporation of objectness labels is necessary, thus greatly
enhancing computational efficiency. Results on five benchmark datasets with
pixel-wise accurate annotations show that the proposed method achieves superior
performance compared with the newest state-of-the-arts in terms of different
evaluation metrics.Comment: The full version of the TIP 2015 publicatio
Automatic annotation for weakly supervised learning of detectors
PhDObject detection in images and action detection in videos are among the most widely studied
computer vision problems, with applications in consumer photography, surveillance, and automatic
media tagging. Typically, these standard detectors are fully supervised, that is they require
a large body of training data where the locations of the objects/actions in images/videos have
been manually annotated. With the emergence of digital media, and the rise of high-speed internet,
raw images and video are available for little to no cost. However, the manual annotation
of object and action locations remains tedious, slow, and expensive. As a result there has been
a great interest in training detectors with weak supervision where only the presence or absence
of object/action in image/video is needed, not the location. This thesis presents approaches for
weakly supervised learning of object/action detectors with a focus on automatically annotating
object and action locations in images/videos using only binary weak labels indicating the presence
or absence of object/action in images/videos.
First, a framework for weakly supervised learning of object detectors in images is presented.
In the proposed approach, a variation of multiple instance learning (MIL) technique for automatically
annotating object locations in weakly labelled data is presented which, unlike existing
approaches, uses inter-class and intra-class cue fusion to obtain the initial annotation. The initial
annotation is then used to start an iterative process in which standard object detectors are used to
refine the location annotation. Finally, to ensure that the iterative training of detectors do not drift
from the object of interest, a scheme for detecting model drift is also presented. Furthermore,
unlike most other methods, our weakly supervised approach is evaluated on data without manual
pose (object orientation) annotation.
Second, an analysis of the initial annotation of objects, using inter-class and intra-class cues,
is carried out. From the analysis, a new method based on negative mining (NegMine) is presented
for the initial annotation of both object and action data. The NegMine based approach is a
much simpler formulation using only inter-class measure and requires no complex combinatorial
optimisation but can still meet or outperform existing approaches including the previously pre3
sented inter-intra class cue fusion approach. Furthermore, NegMine can be fused with existing
approaches to boost their performance.
Finally, the thesis will take a step back and look at the use of generic object detectors as prior
knowledge in weakly supervised learning of object detectors. These generic object detectors are
typically based on sampling saliency maps that indicate if a pixel belongs to the background
or foreground. A new approach to generating saliency maps is presented that, unlike existing
approaches, looks beyond the current image of interest and into images similar to the current
image. We show that our generic object proposal method can be used by itself to annotate the
weakly labelled object data with surprisingly high accuracy
Signal processing algorithms for enhanced image fusion performance and assessment
The dissertation presents several signal processing algorithms for image fusion in noisy multimodal
conditions. It introduces a novel image fusion method which performs well for image
sets heavily corrupted by noise. As opposed to current image fusion schemes, the method has
no requirements for a priori knowledge of the noise component. The image is decomposed with
Chebyshev polynomials (CP) being used as basis functions to perform fusion at feature level. The
properties of CP, namely fast convergence and smooth approximation, renders it ideal for heuristic
and indiscriminate denoising fusion tasks. Quantitative evaluation using objective fusion assessment
methods show favourable performance of the proposed scheme compared to previous efforts
on image fusion, notably in heavily corrupted images.
The approach is further improved by incorporating the advantages of CP with a state-of-the-art
fusion technique named independent component analysis (ICA), for joint-fusion processing
based on region saliency. Whilst CP fusion is robust under severe noise conditions, it is prone to
eliminating high frequency information of the images involved, thereby limiting image sharpness.
Fusion using ICA, on the other hand, performs well in transferring edges and other salient features
of the input images into the composite output. The combination of both methods, coupled with
several mathematical morphological operations in an algorithm fusion framework, is considered a
viable solution. Again, according to the quantitative metrics the results of our proposed approach
are very encouraging as far as joint fusion and denoising are concerned.
Another focus of this dissertation is on a novel metric for image fusion evaluation that is based
on texture. The conservation of background textural details is considered important in many fusion
applications as they help define the image depth and structure, which may prove crucial in
many surveillance and remote sensing applications. Our work aims to evaluate the performance of image fusion algorithms based on their ability to retain textural details from the fusion process.
This is done by utilising the gray-level co-occurrence matrix (GLCM) model to extract second-order
statistical features for the derivation of an image textural measure, which is then used to
replace the edge-based calculations in an objective-based fusion metric. Performance evaluation
on established fusion methods verifies that the proposed metric is viable, especially for multimodal
scenarios
Real-time object detection using monocular vision for low-cost automotive sensing systems
This work addresses the problem of real-time object detection in automotive environments
using monocular vision. The focus is on real-time feature detection,
tracking, depth estimation using monocular vision and finally, object detection by
fusing visual saliency and depth information.
Firstly, a novel feature detection approach is proposed for extracting stable and
dense features even in images with very low signal-to-noise ratio. This methodology
is based on image gradients, which are redefined to take account of noise as
part of their mathematical model. Each gradient is based on a vector connecting a
negative to a positive intensity centroid, where both centroids are symmetric about
the centre of the area for which the gradient is calculated. Multiple gradient vectors
define a feature with its strength being proportional to the underlying gradient
vector magnitude. The evaluation of the Dense Gradient Features (DeGraF) shows
superior performance over other contemporary detectors in terms of keypoint density,
tracking accuracy, illumination invariance, rotation invariance, noise resistance
and detection time.
The DeGraF features form the basis for two new approaches that perform dense
3D reconstruction from a single vehicle-mounted camera. The first approach tracks
DeGraF features in real-time while performing image stabilisation with minimal
computational cost. This means that despite camera vibration the algorithm can
accurately predict the real-world coordinates of each image pixel in real-time by comparing
each motion-vector to the ego-motion vector of the vehicle. The performance
of this approach has been compared to different 3D reconstruction methods in order
to determine their accuracy, depth-map density, noise-resistance and computational
complexity. The second approach proposes the use of local frequency analysis of
i
ii
gradient features for estimating relative depth. This novel method is based on the
fact that DeGraF gradients can accurately measure local image variance with subpixel
accuracy. It is shown that the local frequency by which the centroid oscillates
around the gradient window centre is proportional to the depth of each gradient
centroid in the real world. The lower computational complexity of this methodology
comes at the expense of depth map accuracy as the camera velocity increases, but
it is at least five times faster than the other evaluated approaches.
This work also proposes a novel technique for deriving visual saliency maps by
using Division of Gaussians (DIVoG). In this context, saliency maps express the
difference of each image pixel is to its surrounding pixels across multiple pyramid
levels. This approach is shown to be both fast and accurate when evaluated against
other state-of-the-art approaches. Subsequently, the saliency information is combined
with depth information to identify salient regions close to the host vehicle.
The fused map allows faster detection of high-risk areas where obstacles are likely
to exist. As a result, existing object detection algorithms, such as the Histogram of
Oriented Gradients (HOG) can execute at least five times faster.
In conclusion, through a step-wise approach computationally-expensive algorithms
have been optimised or replaced by novel methodologies to produce a fast object
detection system that is aligned to the requirements of the automotive domain
- β¦