3,943 research outputs found
Review of Visual Saliency Detection with Comprehensive Information
Visual saliency detection model simulates the human visual system to perceive
the scene, and has been widely used in many vision tasks. With the acquisition
technology development, more comprehensive information, such as depth cue,
inter-image correspondence, or temporal relationship, is available to extend
image saliency detection to RGBD saliency detection, co-saliency detection, or
video saliency detection. RGBD saliency detection model focuses on extracting
the salient regions from RGBD images by combining the depth information.
Co-saliency detection model introduces the inter-image correspondence
constraint to discover the common salient object in an image group. The goal of
video saliency detection model is to locate the motion-related salient object
in video sequences, which considers the motion cue and spatiotemporal
constraint jointly. In this paper, we review different types of saliency
detection algorithms, summarize the important issues of the existing methods,
and discuss the existent problems and future works. Moreover, the evaluation
datasets and quantitative measurements are briefly introduced, and the
experimental analysis and discission are conducted to provide a holistic
overview of different saliency detection methods.Comment: 18 pages, 11 figures, 7 tables, Accepted by IEEE Transactions on
Circuits and Systems for Video Technology 2018, https://rmcong.github.io
A Review of Co-saliency Detection Technique: Fundamentals, Applications, and Challenges
Co-saliency detection is a newly emerging and rapidly growing research area
in computer vision community. As a novel branch of visual saliency, co-saliency
detection refers to the discovery of common and salient foregrounds from two or
more relevant images, and can be widely used in many computer vision tasks. The
existing co-saliency detection algorithms mainly consist of three components:
extracting effective features to represent the image regions, exploring the
informative cues or factors to characterize co-saliency, and designing
effective computational frameworks to formulate co-saliency. Although numerous
methods have been developed, the literature is still lacking a deep review and
evaluation of co-saliency detection techniques. In this paper, we aim at
providing a comprehensive review of the fundamentals, challenges, and
applications of co-saliency detection. Specifically, we provide an overview of
some related computer vision works, review the history of co-saliency
detection, summarize and categorize the major algorithms in this research area,
discuss some open issues in this area, present the potential applications of
co-saliency detection, and finally point out some unsolved challenges and
promising future works. We expect this review to be beneficial to both fresh
and senior researchers in this field, and give insights to researchers in other
related areas regarding the utility of co-saliency detection algorithms.Comment: 28 pages, 12 figures, 3 table
Augmented Semantic Signatures of Airborne LiDAR Point Clouds for Comparison
LiDAR point clouds provide rich geometric information, which is particularly
useful for the analysis of complex scenes of urban regions. Finding structural
and semantic differences between two different three-dimensional point clouds,
say, of the same region but acquired at different time instances is an
important problem. A comparison of point clouds involves computationally
expensive registration and segmentation. We are interested in capturing the
relative differences in the geometric uncertainty and semantic content of the
point cloud without the registration process. Hence, we propose an
orientation-invariant geometric signature of the point cloud, which integrates
its probabilistic geometric and semantic classifications. We study different
properties of the geometric signature, which are an image-based encoding of
geometric uncertainty and semantic content. We explore different metrics to
determine differences between these signatures, which in turn compare point
clouds without performing point-to-point registration. Our results show that
the differences in the signatures corroborate with the geometric and semantic
differences of the point clouds.Comment: 18 pages, 6 figures, 1 tabl
D2D: Keypoint Extraction with Describe to Detect Approach
In this paper, we present a novel approach that exploits the information
within the descriptor space to propose keypoint locations. Detect then
describe, or detect and describe jointly are two typical strategies for
extracting local descriptors. In contrast, we propose an approach that inverts
this process by first describing and then detecting the keypoint locations. %
Describe-to-Detect (D2D) leverages successful descriptor models without the
need for any additional training. Our method selects keypoints as salient
locations with high information content which is defined by the descriptors
rather than some independent operators. We perform experiments on multiple
benchmarks including image matching, camera localisation, and 3D
reconstruction. The results indicate that our method improves the matching
performance of various descriptors and that it generalises across methods and
tasks
Video Salient Object Detection Using Spatiotemporal Deep Features
This paper presents a method for detecting salient objects in videos where
temporal information in addition to spatial information is fully taken into
account. Following recent reports on the advantage of deep features over
conventional hand-crafted features, we propose a new set of SpatioTemporal Deep
(STD) features that utilize local and global contexts over frames. We also
propose new SpatioTemporal Conditional Random Field (STCRF) to compute saliency
from STD features. STCRF is our extension of CRF to the temporal domain and
describes the relationships among neighboring regions both in a frame and over
frames. STCRF leads to temporally consistent saliency maps over frames,
contributing to the accurate detection of salient objects' boundaries and noise
reduction during detection. Our proposed method first segments an input video
into multiple scales and then computes a saliency map at each scale level using
STD features with STCRF. The final saliency map is computed by fusing saliency
maps at different scale levels. Our experiments, using publicly available
benchmark datasets, confirm that the proposed method significantly outperforms
state-of-the-art methods. We also applied our saliency computation to the video
object segmentation task, showing that our method outperforms existing video
object segmentation methods.Comment: accepted at TI
Graph-Theoretic Spatiotemporal Context Modeling for Video Saliency Detection
As an important and challenging problem in computer vision, video saliency
detection is typically cast as a spatiotemporal context modeling problem over
consecutive frames. As a result, a key issue in video saliency detection is how
to effectively capture the intrinsical properties of atomic video structures as
well as their associated contextual interactions along the spatial and temporal
dimensions. Motivated by this observation, we propose a graph-theoretic video
saliency detection approach based on adaptive video structure discovery, which
is carried out within a spatiotemporal atomic graph. Through graph-based
manifold propagation, the proposed approach is capable of effectively modeling
the semantically contextual interactions among atomic video structures for
saliency detection while preserving spatial smoothness and temporal
consistency. Experiments demonstrate the effectiveness of the proposed approach
over several benchmark datasets.Comment: ICIP 201
Attend Before you Act: Leveraging human visual attention for continual learning
When humans perform a task, such as playing a game, they selectively pay
attention to certain parts of the visual input, gathering relevant information
and sequentially combining it to build a representation from the sensory data.
In this work, we explore leveraging where humans look in an image as an
implicit indication of what is salient for decision making. We build on top of
the UNREAL architecture in DeepMind Lab's 3D navigation maze environment. We
train the agent both with original images and foveated images, which were
generated by overlaying the original images with saliency maps generated using
a real-time spectral residual technique. We investigate the effectiveness of
this approach in transfer learning by measuring performance in the context of
noise in the environment.Comment: Lifelong Learning: A Reinforcement Learning Approach (LLARLA)
Workshop, ICML 201
Exploiting Egocentric Object Prior for 3D Saliency Detection
On a minute-to-minute basis people undergo numerous fluid interactions with
objects that barely register on a conscious level. Recent neuroscientific
research demonstrates that humans have a fixed size prior for salient objects.
This suggests that a salient object in 3D undergoes a consistent transformation
such that people's visual system perceives it with an approximately fixed size.
This finding indicates that there exists a consistent egocentric object prior
that can be characterized by shape, size, depth, and location in the first
person view.
In this paper, we develop an EgoObject Representation, which encodes these
characteristics by incorporating shape, location, size and depth features from
an egocentric RGBD image. We empirically show that this representation can
accurately characterize the egocentric object prior by testing it on an
egocentric RGBD dataset for three tasks: the 3D saliency detection, future
saliency prediction, and interaction classification. This representation is
evaluated on our new Egocentric RGBD Saliency dataset that includes various
activities such as cooking, dining, and shopping. By using our EgoObject
representation, we outperform previously proposed models for saliency detection
(relative 30% improvement for 3D saliency detection task) on our dataset.
Additionally, we demonstrate that this representation allows us to predict
future salient objects based on the gaze cue and classify people's interactions
with objects
Benchmark 3D eye-tracking dataset for visual saliency prediction on stereoscopic 3D video
Visual Attention Models (VAMs) predict the location of an image or video
regions that are most likely to attract human attention. Although saliency
detection is well explored for 2D image and video content, there are only few
attempts made to design 3D saliency prediction models. Newly proposed 3D visual
attention models have to be validated over large-scale video saliency
prediction datasets, which also contain results of eye-tracking information.
There are several publicly available eye-tracking datasets for 2D image and
video content. In the case of 3D, however, there is still a need for
large-scale video saliency datasets for the research community for validating
different 3D-VAMs. In this paper, we introduce a large-scale dataset containing
eye-tracking data collected from 61 stereoscopic 3D videos (and also 2D
versions of those) and 24 subjects participated in a free-viewing test. We
evaluate the performance of the existing saliency detection methods over the
proposed dataset. In addition, we created an online benchmark for validating
the performance of the existing 2D and 3D visual attention models and
facilitate addition of new VAMs to the benchmark. Our benchmark currently
contains 50 different VAMs
Interpretable Learning for Self-Driving Cars by Visualizing Causal Attention
Deep neural perception and control networks are likely to be a key component
of self-driving vehicles. These models need to be explainable - they should
provide easy-to-interpret rationales for their behavior - so that passengers,
insurance companies, law enforcement, developers etc., can understand what
triggered a particular behavior. Here we explore the use of visual
explanations. These explanations take the form of real-time highlighted regions
of an image that causally influence the network's output (steering control).
Our approach is two-stage. In the first stage, we use a visual attention model
to train a convolution network end-to-end from images to steering angle. The
attention model highlights image regions that potentially influence the
network's output. Some of these are true influences, but some are spurious. We
then apply a causal filtering step to determine which input regions actually
influence the output. This produces more succinct visual explanations and more
accurately exposes the network's behavior. We demonstrate the effectiveness of
our model on three datasets totaling 16 hours of driving. We first show that
training with attention does not degrade the performance of the end-to-end
network. Then we show that the network causally cues on a variety of features
that are used by humans while driving
- …