2,005 research outputs found
A Review of Co-saliency Detection Technique: Fundamentals, Applications, and Challenges
Co-saliency detection is a newly emerging and rapidly growing research area
in computer vision community. As a novel branch of visual saliency, co-saliency
detection refers to the discovery of common and salient foregrounds from two or
more relevant images, and can be widely used in many computer vision tasks. The
existing co-saliency detection algorithms mainly consist of three components:
extracting effective features to represent the image regions, exploring the
informative cues or factors to characterize co-saliency, and designing
effective computational frameworks to formulate co-saliency. Although numerous
methods have been developed, the literature is still lacking a deep review and
evaluation of co-saliency detection techniques. In this paper, we aim at
providing a comprehensive review of the fundamentals, challenges, and
applications of co-saliency detection. Specifically, we provide an overview of
some related computer vision works, review the history of co-saliency
detection, summarize and categorize the major algorithms in this research area,
discuss some open issues in this area, present the potential applications of
co-saliency detection, and finally point out some unsolved challenges and
promising future works. We expect this review to be beneficial to both fresh
and senior researchers in this field, and give insights to researchers in other
related areas regarding the utility of co-saliency detection algorithms.Comment: 28 pages, 12 figures, 3 table
Predicting Gaze in Egocentric Video by Learning Task-dependent Attention Transition
We present a new computational model for gaze prediction in egocentric videos
by exploring patterns in temporal shift of gaze fixations (attention
transition) that are dependent on egocentric manipulation tasks. Our assumption
is that the high-level context of how a task is completed in a certain way has
a strong influence on attention transition and should be modeled for gaze
prediction in natural dynamic scenes. Specifically, we propose a hybrid model
based on deep neural networks which integrates task-dependent attention
transition with bottom-up saliency prediction. In particular, the
task-dependent attention transition is learned with a recurrent neural network
to exploit the temporal context of gaze fixations, e.g. looking at a cup after
moving gaze away from a grasped bottle. Experiments on public egocentric
activity datasets show that our model significantly outperforms
state-of-the-art gaze prediction methods and is able to learn meaningful
transition of human attention.Comment: Accepted as oral presentation in ECCV 201
Spatio-Temporal Saliency Networks for Dynamic Saliency Prediction
Computational saliency models for still images have gained significant
popularity in recent years. Saliency prediction from videos, on the other hand,
has received relatively little interest from the community. Motivated by this,
in this work, we study the use of deep learning for dynamic saliency prediction
and propose the so-called spatio-temporal saliency networks. The key to our
models is the architecture of two-stream networks where we investigate
different fusion mechanisms to integrate spatial and temporal information. We
evaluate our models on the DIEM and UCF-Sports datasets and present highly
competitive results against the existing state-of-the-art models. We also carry
out some experiments on a number of still images from the MIT300 dataset by
exploiting the optical flow maps predicted from these images. Our results show
that considering inherent motion information in this way can be helpful for
static saliency estimation
Review of Visual Saliency Detection with Comprehensive Information
Visual saliency detection model simulates the human visual system to perceive
the scene, and has been widely used in many vision tasks. With the acquisition
technology development, more comprehensive information, such as depth cue,
inter-image correspondence, or temporal relationship, is available to extend
image saliency detection to RGBD saliency detection, co-saliency detection, or
video saliency detection. RGBD saliency detection model focuses on extracting
the salient regions from RGBD images by combining the depth information.
Co-saliency detection model introduces the inter-image correspondence
constraint to discover the common salient object in an image group. The goal of
video saliency detection model is to locate the motion-related salient object
in video sequences, which considers the motion cue and spatiotemporal
constraint jointly. In this paper, we review different types of saliency
detection algorithms, summarize the important issues of the existing methods,
and discuss the existent problems and future works. Moreover, the evaluation
datasets and quantitative measurements are briefly introduced, and the
experimental analysis and discission are conducted to provide a holistic
overview of different saliency detection methods.Comment: 18 pages, 11 figures, 7 tables, Accepted by IEEE Transactions on
Circuits and Systems for Video Technology 2018, https://rmcong.github.io
Attentional Pooling for Action Recognition
We introduce a simple yet surprisingly powerful model to incorporate
attention in action recognition and human object interaction tasks. Our
proposed attention module can be trained with or without extra supervision, and
gives a sizable boost in accuracy while keeping the network size and
computational cost nearly the same. It leads to significant improvements over
state of the art base architecture on three standard action recognition
benchmarks across still images and videos, and establishes new state of the art
on MPII dataset with 12.5% relative improvement. We also perform an extensive
analysis of our attention module both empirically and analytically. In terms of
the latter, we introduce a novel derivation of bottom-up and top-down attention
as low-rank approximations of bilinear pooling methods (typically used for
fine-grained classification). From this perspective, our attention formulation
suggests a novel characterization of action recognition as a fine-grained
recognition problem.Comment: In NIPS 2017. Project page:
https://rohitgirdhar.github.io/AttentionalPoolingAction
Saliency Detection for Stereoscopic Images Based on Depth Confidence Analysis and Multiple Cues Fusion
Stereoscopic perception is an important part of human visual system that
allows the brain to perceive depth. However, depth information has not been
well explored in existing saliency detection models. In this letter, a novel
saliency detection method for stereoscopic images is proposed. Firstly, we
propose a measure to evaluate the reliability of depth map, and use it to
reduce the influence of poor depth map on saliency detection. Then, the input
image is represented as a graph, and the depth information is introduced into
graph construction. After that, a new definition of compactness using color and
depth cues is put forward to compute the compactness saliency map. In order to
compensate the detection errors of compactness saliency when the salient
regions have similar appearances with background, foreground saliency map is
calculated based on depth-refined foreground seeds selection mechanism and
multiple cues contrast. Finally, these two saliency maps are integrated into a
final saliency map through weighted-sum method according to their importance.
Experiments on two publicly available stereo datasets demonstrate that the
proposed method performs better than other 10 state-of-the-art approaches.Comment: 5 pages, 6 figures, Published on IEEE Signal Processing Letters 2016,
Project URL: https://rmcong.github.io/proj_RGBD_sal.htm
Spontaneous Facial Micro-Expression Recognition using 3D Spatiotemporal Convolutional Neural Networks
Facial expression recognition in videos is an active area of research in
computer vision. However, fake facial expressions are difficult to be
recognized even by humans. On the other hand, facial micro-expressions
generally represent the actual emotion of a person, as it is a spontaneous
reaction expressed through human face. Despite of a few attempts made for
recognizing micro-expressions, still the problem is far from being a solved
problem, which is depicted by the poor rate of accuracy shown by the
state-of-the-art methods. A few CNN based approaches are found in the
literature to recognize micro-facial expressions from still images. Whereas, a
spontaneous micro-expression video contains multiple frames that have to be
processed together to encode both spatial and temporal information. This paper
proposes two 3D-CNN methods: MicroExpSTCNN and MicroExpFuseNet, for spontaneous
facial micro-expression recognition by exploiting the spatiotemporal
information in CNN framework. The MicroExpSTCNN considers the full spatial
information, whereas the MicroExpFuseNet is based on the 3D-CNN feature fusion
of the eyes and mouth regions. The experiments are performed over CAS(ME)^2 and
SMIC micro-expression databases. The proposed MicroExpSTCNN model outperforms
the state-of-the-art methods.Comment: Accepted in 2019 International Joint Conference on Neural Networks
(IJCNN
LookinGood: Enhancing Performance Capture with Real-time Neural Re-Rendering
Motivated by augmented and virtual reality applications such as telepresence,
there has been a recent focus in real-time performance capture of humans under
motion. However, given the real-time constraint, these systems often suffer
from artifacts in geometry and texture such as holes and noise in the final
rendering, poor lighting, and low-resolution textures. We take the novel
approach to augment such real-time performance capture systems with a deep
architecture that takes a rendering from an arbitrary viewpoint, and jointly
performs completion, super resolution, and denoising of the imagery in
real-time. We call this approach neural (re-)rendering, and our live system
"LookinGood". Our deep architecture is trained to produce high resolution and
high quality images from a coarse rendering in real-time. First, we propose a
self-supervised training method that does not require manual ground-truth
annotation. We contribute a specialized reconstruction error that uses semantic
information to focus on relevant parts of the subject, e.g. the face. We also
introduce a salient reweighing scheme of the loss function that is able to
discard outliers. We specifically design the system for virtual and augmented
reality headsets where the consistency between the left and right eye plays a
crucial role in the final user experience. Finally, we generate temporally
stable results by explicitly minimizing the difference between two consecutive
frames. We tested the proposed system in two different scenarios: one involving
a single RGB-D sensor, and upper body reconstruction of an actor, the second
consisting of full body 360 degree capture. Through extensive experimentation,
we demonstrate how our system generalizes across unseen sequences and subjects.
The supplementary video is available at http://youtu.be/Md3tdAKoLGU.Comment: The supplementary video is available at: http://youtu.be/Md3tdAKoLGU
To be presented at SIGGRAPH Asia 201
Salient Object Detection: A Distinctive Feature Integration Model
We propose a novel method for salient object detection in different images.
Our method integrates spatial features for efficient and robust representation
to capture meaningful information about the salient objects. We then train a
conditional random field (CRF) using the integrated features. The trained CRF
model is then used to detect salient objects during the online testing stage.
We perform experiments on two standard datasets and compare the performance of
our method with different reference methods. Our experiments show that our
method outperforms the compared methods in terms of precision, recall, and
F-Measure
Graph-Theoretic Spatiotemporal Context Modeling for Video Saliency Detection
As an important and challenging problem in computer vision, video saliency
detection is typically cast as a spatiotemporal context modeling problem over
consecutive frames. As a result, a key issue in video saliency detection is how
to effectively capture the intrinsical properties of atomic video structures as
well as their associated contextual interactions along the spatial and temporal
dimensions. Motivated by this observation, we propose a graph-theoretic video
saliency detection approach based on adaptive video structure discovery, which
is carried out within a spatiotemporal atomic graph. Through graph-based
manifold propagation, the proposed approach is capable of effectively modeling
the semantically contextual interactions among atomic video structures for
saliency detection while preserving spatial smoothness and temporal
consistency. Experiments demonstrate the effectiveness of the proposed approach
over several benchmark datasets.Comment: ICIP 201
- …