7,431 research outputs found
Saliency Supervision: An Intuitive and Effective Approach for Pain Intensity Regression
Getting pain intensity from face images is an important problem in autonomous
nursing systems. However, due to the limitation in data sources and the
subjectiveness in pain intensity values, it is hard to adopt modern deep neural
networks for this problem without domain-specific auxiliary design. Inspired by
human vision priori, we propose a novel approach called saliency supervision,
where we directly regularize deep networks to focus on facial area that is
discriminative for pain regression. Through alternative training between
saliency supervision and global loss, our method can learn sparse and robust
features, which is proved helpful for pain intensity regression. We verified
saliency supervision with face-verification network backbone on the widely-used
dataset, and achieved state-of-art performance without bells and whistles. Our
saliency supervision is intuitive in spirit, yet effective in performance. We
believe such saliency supervision is essential in dealing with ill-posed
datasets, and has potential in a wide range of vision tasks
Half-CNN: A General Framework for Whole-Image Regression
The Convolutional Neural Network (CNN) has achieved great success in image
classification. The classification model can also be utilized at image or patch
level for many other applications, such as object detection and segmentation.
In this paper, we propose a whole-image CNN regression model, by removing the
full connection layer and training the network with continuous feature maps.
This is a generic regression framework that fits many applications. We
demonstrate this method through two tasks: simultaneous face detection &
segmentation, and scene saliency prediction. The result is comparable with
other models in the respective fields, using only a small scale network. Since
the regression model is trained on corresponding image / feature map pairs,
there are no requirements on uniform input size as opposed to the
classification model. Our framework avoids classifier design, a process that
may introduce too much manual intervention in model development. Yet, it is
highly correlated to the classification network and offers some in-deep review
of CNN structures
Saliency Prediction in the Deep Learning Era: Successes, Limitations, and Future Challenges
Visual saliency models have enjoyed a big leap in performance in recent
years, thanks to advances in deep learning and large scale annotated data.
Despite enormous effort and huge breakthroughs, however, models still fall
short in reaching human-level accuracy. In this work, I explore the landscape
of the field emphasizing on new deep saliency models, benchmarks, and datasets.
A large number of image and video saliency models are reviewed and compared
over two image benchmarks and two large scale video datasets. Further, I
identify factors that contribute to the gap between models and humans and
discuss remaining issues that need to be addressed to build the next generation
of more powerful saliency models. Some specific questions that are addressed
include: in what ways current models fail, how to remedy them, what can be
learned from cognitive studies of attention, how explicit saliency judgments
relate to fixations, how to conduct fair model comparison, and what are the
emerging applications of saliency models
FaceSpoof Buster: a Presentation Attack Detector Based on Intrinsic Image Properties and Deep Learning
Nowadays, the adoption of face recognition for biometric authentication
systems is usual, mainly because this is one of the most accessible biometric
modalities. Techniques that rely on trespassing these kind of systems by using
a forged biometric sample, such as a printed paper or a recorded video of a
genuine access, are known as presentation attacks, but may be also referred in
the literature as face spoofing. Presentation attack detection is a crucial
step for preventing this kind of unauthorized accesses into restricted areas
and/or devices. In this paper, we propose a novel approach which relies in a
combination between intrinsic image properties and deep neural networks to
detect presentation attack attempts. Our method explores depth, salience and
illumination maps, associated with a pre-trained Convolutional Neural Network
in order to produce robust and discriminant features. Each one of these
properties are individually classified and, in the end of the process, they are
combined by a meta learning classifier, which achieves outstanding results on
the most popular datasets for PAD. Results show that proposed method is able to
overpass state-of-the-art results in an inter-dataset protocol, which is
defined as the most challenging in the literature.Comment: 7 pages, 1 figure, 7 table
Predicting Video Saliency with Object-to-Motion CNN and Two-layer Convolutional LSTM
Over the past few years, deep neural networks (DNNs) have exhibited great
success in predicting the saliency of images. However, there are few works that
apply DNNs to predict the saliency of generic videos. In this paper, we propose
a novel DNN-based video saliency prediction method. Specifically, we establish
a large-scale eye-tracking database of videos (LEDOV), which provides
sufficient data to train the DNN models for predicting video saliency. Through
the statistical analysis of our LEDOV database, we find that human attention is
normally attracted by objects, particularly moving objects or the moving parts
of objects. Accordingly, we propose an object-to-motion convolutional neural
network (OM-CNN) to learn spatio-temporal features for predicting the
intra-frame saliency via exploring the information of both objectness and
object motion. We further find from our database that there exists a temporal
correlation of human attention with a smooth saliency transition across video
frames. Therefore, we develop a two-layer convolutional long short-term memory
(2C-LSTM) network in our DNN-based method, using the extracted features of
OM-CNN as the input. Consequently, the inter-frame saliency maps of videos can
be generated, which consider the transition of attention across video frames.
Finally, the experimental results show that our method advances the
state-of-the-art in video saliency prediction.Comment: Jiang, Lai and Xu, Mai and Liu, Tie and Qiao, Minglang and Wang,
Zulin; DeepVS: A Deep Learning Based Video Saliency Prediction Approach;The
European Conference on Computer Vision (ECCV); September 201
cvpaper.challenge in 2015 - A review of CVPR2015 and DeepSurvey
The "cvpaper.challenge" is a group composed of members from AIST, Tokyo Denki
Univ. (TDU), and Univ. of Tsukuba that aims to systematically summarize papers
on computer vision, pattern recognition, and related fields. For this
particular review, we focused on reading the ALL 602 conference papers
presented at the CVPR2015, the premier annual computer vision event held in
June 2015, in order to grasp the trends in the field. Further, we are proposing
"DeepSurvey" as a mechanism embodying the entire process from the reading
through all the papers, the generation of ideas, and to the writing of paper.Comment: Survey Pape
Semantic and Contrast-Aware Saliency
In this paper, we proposed an integrated model of semantic-aware and
contrast-aware saliency combining both bottom-up and top-down cues for
effective saliency estimation and eye fixation prediction. The proposed model
processes visual information using two pathways. The first pathway aims to
capture the attractive semantic information in images, especially for the
presence of meaningful objects and object parts such as human faces. The second
pathway is based on multi-scale on-line feature learning and information
maximization, which learns an adaptive sparse representation for the input and
discovers the high contrast salient patterns within the image context. The two
pathways characterize both long-term and short-term attention cues and are
integrated dynamically using maxima normalization. We investigate two different
implementations of the semantic pathway including an End-to-End deep neural
network solution and a dynamic feature integration solution, resulting in the
SCA and SCAFI model respectively. Experimental results on artificial images and
5 popular benchmark datasets demonstrate the superior performance and better
plausibility of the proposed model over both classic approaches and recent deep
models.Comment: arXiv admin note: text overlap with arXiv:1710.04071 by other author
Richer and Deeper Supervision Network for Salient Object Detection
Recent Salient Object Detection (SOD) systems are mostly based on
Convolutional Neural Networks (CNNs). Specifically, Deeply Supervised Saliency
(DSS) system has shown it is very useful to add short connections to the
network and supervising on the side output. In this work, we propose a new SOD
system which aims at designing a more efficient and effective way to pass back
global information. Richer and Deeper Supervision (RDS) is applied to better
combine features from each side output without demanding much extra
computational space. Meanwhile, the backbone network used for SOD is normally
pre-trained on the object classification dataset, ImageNet. But the pre-trained
model has been trained on cropped images in order to only focus on
distinguishing features within the region of the object. But the ignored
background information is also significant in the task of SOD. We try to solve
this problem by introducing the training data designed for object detection. A
coarse global information is learned based on an entire image with its bounding
box before training on the SOD dataset. The large-scale of object images can
slightly improve the performance of SOD. Our experiment shows the proposed RDS
network achieves the state-of-the-art results on five public SOD datasets
Multi-source weak supervision for saliency detection
The high cost of pixel-level annotations makes it appealing to train saliency
detection models with weak supervision. However, a single weak supervision
source usually does not contain enough information to train a well-performing
model. To this end, we propose a unified framework to train saliency detection
models with diverse weak supervision sources. In this paper, we use category
labels, captions, and unlabelled data for training, yet other supervision
sources can also be plugged into this flexible framework. We design a
classification network (CNet) and a caption generation network (PNet), which
learn to predict object categories and generate captions, respectively,
meanwhile highlight the most important regions for corresponding tasks. An
attention transfer loss is designed to transmit supervision signal between
networks, such that the network designed to be trained with one supervision
source can benefit from another. An attention coherence loss is defined on
unlabelled data to encourage the networks to detect generally salient regions
instead of task-specific regions. We use CNet and PNet to generate pixel-level
pseudo labels to train a saliency prediction network (SNet). During the testing
phases, we only need SNet to predict saliency maps. Experiments demonstrate the
performance of our method compares favourably against unsupervised and weakly
supervised methods and even some supervised methods.Comment: cvpr201
Beyond saliency: understanding convolutional neural networks from saliency prediction on layer-wise relevance propagation
Despite the tremendous achievements of deep convolutional neural networks
(CNNs) in many computer vision tasks, understanding how they actually work
remains a significant challenge. In this paper, we propose a novel two-step
understanding method, namely Salient Relevance (SR) map, which aims to shed
light on how deep CNNs recognize images and learn features from areas, referred
to as attention areas, therein. Our proposed method starts out with a
layer-wise relevance propagation (LRP) step which estimates a pixel-wise
relevance map over the input image. Following, we construct a context-aware
saliency map, SR map, from the LRP-generated map which predicts areas close to
the foci of attention instead of isolated pixels that LRP reveals. In human
visual system, information of regions is more important than of pixels in
recognition. Consequently, our proposed approach closely simulates human
recognition. Experimental results using the ILSVRC2012 validation dataset in
conjunction with two well-established deep CNN models, AlexNet and VGG-16,
clearly demonstrate that our proposed approach concisely identifies not only
key pixels but also attention areas that contribute to the underlying neural
network's comprehension of the given images. As such, our proposed SR map
constitutes a convenient visual interface which unveils the visual attention of
the network and reveals which type of objects the model has learned to
recognize after training. The source code is available at
https://github.com/Hey1Li/Salient-Relevance-Propagation.Comment: 35 pages, 15 figure
- …