5,630 research outputs found
Object Discovery From a Single Unlabeled Image by Mining Frequent Itemset With Multi-scale Features
TThe goal of our work is to discover dominant objects in a very general
setting where only a single unlabeled image is given. This is far more
challenge than typical co-localization or weakly-supervised localization tasks.
To tackle this problem, we propose a simple but effective pattern mining-based
method, called Object Location Mining (OLM), which exploits the advantages of
data mining and feature representation of pre-trained convolutional neural
networks (CNNs). Specifically, we first convert the feature maps from a
pre-trained CNN model into a set of transactions, and then discovers frequent
patterns from transaction database through pattern mining techniques. We
observe that those discovered patterns, i.e., co-occurrence highlighted
regions, typically hold appearance and spatial consistency. Motivated by this
observation, we can easily discover and localize possible objects by merging
relevant meaningful patterns. Extensive experiments on a variety of benchmarks
demonstrate that OLM achieves competitive localization performance compared
with the state-of-the-art methods. We also evaluate our approach compared with
unsupervised saliency detection methods and achieves competitive results on
seven benchmark datasets. Moreover, we conduct experiments on fine-grained
classification to show that our proposed method can locate the entire object
and parts accurately, which can benefit to improving the classification results
significantly
Top-Down Saliency Detection Driven by Visual Classification
This paper presents an approach for top-down saliency detection guided by
visual classification tasks. We first learn how to compute visual saliency when
a specific visual task has to be accomplished, as opposed to most
state-of-the-art methods which assess saliency merely through bottom-up
principles. Afterwards, we investigate if and to what extent visual saliency
can support visual classification in nontrivial cases. To achieve this, we
propose SalClassNet, a CNN framework consisting of two networks jointly
trained: a) the first one computing top-down saliency maps from input images,
and b) the second one exploiting the computed saliency maps for visual
classification. To test our approach, we collected a dataset of eye-gaze maps,
using a Tobii T60 eye tracker, by asking several subjects to look at images
from the Stanford Dogs dataset, with the objective of distinguishing dog
breeds. Performance analysis on our dataset and other saliency bench-marking
datasets, such as POET, showed that SalClassNet out-performs state-of-the-art
saliency detectors, such as SalNet and SALICON. Finally, we analyzed the
performance of SalClassNet in a fine-grained recognition task and found out
that it generalizes better than existing visual classifiers. The achieved
results, thus, demonstrate that 1) conditioning saliency detectors with object
classes reaches state-of-the-art performance, and 2) providing explicitly
top-down saliency maps to visual classifiers enhances classification accuracy
DeepSaliency: Multi-Task Deep Neural Network Model for Salient Object Detection
A key problem in salient object detection is how to effectively model the
semantic properties of salient objects in a data-driven manner. In this paper,
we propose a multi-task deep saliency model based on a fully convolutional
neural network (FCNN) with global input (whole raw images) and global output
(whole saliency maps). In principle, the proposed saliency model takes a
data-driven strategy for encoding the underlying saliency prior information,
and then sets up a multi-task learning scheme for exploring the intrinsic
correlations between saliency detection and semantic image segmentation.
Through collaborative feature learning from such two correlated tasks, the
shared fully convolutional layers produce effective features for object
perception. Moreover, it is capable of capturing the semantic information on
salient objects across different levels using the fully convolutional layers,
which investigate the feature-sharing properties of salient object detection
with great feature redundancy reduction. Finally, we present a graph Laplacian
regularized nonlinear regression model for saliency refinement. Experimental
results demonstrate the effectiveness of our approach in comparison with the
state-of-the-art approaches.Comment: To appear in IEEE Transactions on Image Processing (TIP), Project
Website: http://www.zhaoliming.net/research/deepsalienc
DISC: Deep Image Saliency Computing via Progressive Representation Learning
Salient object detection increasingly receives attention as an important
component or step in several pattern recognition and image processing tasks.
Although a variety of powerful saliency models have been intensively proposed,
they usually involve heavy feature (or model) engineering based on priors (or
assumptions) about the properties of objects and backgrounds. Inspired by the
effectiveness of recently developed feature learning, we provide a novel Deep
Image Saliency Computing (DISC) framework for fine-grained image saliency
computing. In particular, we model the image saliency from both the coarse- and
fine-level observations, and utilize the deep convolutional neural network
(CNN) to learn the saliency representation in a progressive manner.
Specifically, our saliency model is built upon two stacked CNNs. The first CNN
generates a coarse-level saliency map by taking the overall image as the input,
roughly identifying saliency regions in the global context. Furthermore, we
integrate superpixel-based local context information in the first CNN to refine
the coarse-level saliency map. Guided by the coarse saliency map, the second
CNN focuses on the local context to produce fine-grained and accurate saliency
map while preserving object details. For a testing image, the two CNNs
collaboratively conduct the saliency computing in one shot. Our DISC framework
is capable of uniformly highlighting the objects-of-interest from complex
background while preserving well object details. Extensive experiments on
several standard benchmarks suggest that DISC outperforms other
state-of-the-art methods and it also generalizes well across datasets without
additional training. The executable version of DISC is available online:
http://vision.sysu.edu.cn/projects/DISC.Comment: This manuscript is the accepted version for IEEE Transactions on
Neural Networks and Learning Systems (T-NNLS), 201
Improving Landmark Recognition using Saliency detection and Feature classification
Image Landmark Recognition has been one of the most sought-after
classification challenges in the field of vision and perception. After so many
years of generic classification of buildings and monuments from images, people
are now focussing upon fine-grained problems - recognizing the category of each
building or monument. We proposed an ensemble network for the purpose of
classification of Indian Landmark Images. To this end, our method gives robust
classification by ensembling the predictions from Graph-Based Visual Saliency
(GBVS) network alongwith supervised feature-based classification algorithms
such as kNN and Random Forest. The final architecture is an adaptive learning
of all the mentioned networks. The proposed network produces a reliable score
to eliminate false category cases. Evaluation of our model was done on a new
dataset, which involves challenges such as landmark clutter, variable scaling,
partial occlusion, etc.Comment: Pre-print of the paper to be published in Springer, accepted in the
proceedings of the in 2nd Workshop on Digital Heritage at the 11th Indian
Conference on Computer Vision, Graphics and Image Processin
A-Lamp: Adaptive Layout-Aware Multi-Patch Deep Convolutional Neural Network for Photo Aesthetic Assessment
Deep convolutional neural networks (CNN) have recently been shown to generate
promising results for aesthetics assessment. However, the performance of these
deep CNN methods is often compromised by the constraint that the neural network
only takes the fixed-size input. To accommodate this requirement, input images
need to be transformed via cropping, warping, or padding, which often alter
image composition, reduce image resolution, or cause image distortion. Thus the
aesthetics of the original images is impaired because of potential loss of fine
grained details and holistic image layout. However, such fine grained details
and holistic image layout is critical for evaluating an image's aesthetics. In
this paper, we present an Adaptive Layout-Aware Multi-Patch Convolutional
Neural Network (A-Lamp CNN) architecture for photo aesthetic assessment. This
novel scheme is able to accept arbitrary sized images, and learn from both
fined grained details and holistic image layout simultaneously. To enable
training on these hybrid inputs, we extend the method by developing a dedicated
double-subnet neural network structure, i.e. a Multi-Patch subnet and a
Layout-Aware subnet. We further construct an aggregation layer to effectively
combine the hybrid features from these two subnets. Extensive experiments on
the large-scale aesthetics assessment benchmark (AVA) demonstrate significant
performance improvement over the state-of-the-art in photo aesthetic
assessment
Cross-Modal Attentional Context Learning for RGB-D Object Detection
Recognizing objects from simultaneously sensed photometric (RGB) and depth
channels is a fundamental yet practical problem in many machine vision
applications such as robot grasping and autonomous driving. In this paper, we
address this problem by developing a Cross-Modal Attentional Context (CMAC)
learning framework, which enables the full exploitation of the context
information from both RGB and depth data. Compared to existing RGB-D object
detection frameworks, our approach has several appealing properties. First, it
consists of an attention-based global context model for exploiting adaptive
contextual information and incorporating this information into a region-based
CNN (e.g., Fast RCNN) framework to achieve improved object detection
performance. Second, our CMAC framework further contains a fine-grained object
part attention module to harness multiple discriminative object parts inside
each possible object region for superior local feature representation. While
greatly improving the accuracy of RGB-D object detection, the effective
cross-modal information fusion as well as attentional context modeling in our
proposed model provide an interpretable visualization scheme. Experimental
results demonstrate that the proposed method significantly improves upon the
state of the art on all public benchmarks.Comment: Accept as a regular paper to IEEE Transactions on Image Processin
Saliency-Guided Attention Network for Image-Sentence Matching
This paper studies the task of matching image and sentence, where learning
appropriate representations across the multi-modal data appears to be the main
challenge. Unlike previous approaches that predominantly deploy symmetrical
architecture to represent both modalities, we propose Saliency-guided Attention
Network (SAN) that asymmetrically employs visual and textual attention modules
to learn the fine-grained correlation intertwined between vision and language.
The proposed SAN mainly includes three components: saliency detector,
Saliency-weighted Visual Attention (SVA) module, and Saliency-guided Textual
Attention (STA) module. Concretely, the saliency detector provides the visual
saliency information as the guidance for the two attention modules. SVA is
designed to leverage the advantage of the saliency information to improve
discrimination of visual representations. By fusing the visual information from
SVA and textual information as a multi-modal guidance, STA learns
discriminative textual representations that are highly sensitive to visual
clues. Extensive experiments demonstrate SAN can substantially improve the
state-of-the-art results on the benchmark Flickr30K and MSCOCO datasets by a
large margin.Comment: 10 pages, 5 figure
Hierarchical Cellular Automata for Visual Saliency
Saliency detection, finding the most important parts of an image, has become
increasingly popular in computer vision. In this paper, we introduce
Hierarchical Cellular Automata (HCA) -- a temporally evolving model to
intelligently detect salient objects. HCA consists of two main components:
Single-layer Cellular Automata (SCA) and Cuboid Cellular Automata (CCA). As an
unsupervised propagation mechanism, Single-layer Cellular Automata can exploit
the intrinsic relevance of similar regions through interactions with neighbors.
Low-level image features as well as high-level semantic information extracted
from deep neural networks are incorporated into the SCA to measure the
correlation between different image patches. With these hierarchical deep
features, an impact factor matrix and a coherence matrix are constructed to
balance the influences on each cell's next state. The saliency values of all
cells are iteratively updated according to a well-defined update rule.
Furthermore, we propose CCA to integrate multiple saliency maps generated by
SCA at different scales in a Bayesian framework. Therefore, single-layer
propagation and multi-layer integration are jointly modeled in our unified HCA.
Surprisingly, we find that the SCA can improve all existing methods that we
applied it to, resulting in a similar precision level regardless of the
original results. The CCA can act as an efficient pixel-wise aggregation
algorithm that can integrate state-of-the-art methods, resulting in even better
results. Extensive experiments on four challenging datasets demonstrate that
the proposed algorithm outperforms state-of-the-art conventional methods and is
competitive with deep learning based approaches
Recommended from our members
Explainable and Advisable Learning for Self-driving Vehicles
Deep neural perception and control networks are likely to be a key component of self-driving vehicles. These models need to be explainable - they should provide easy-to-interpret rationales for their behavior - so that passengers, insurance companies, law enforcement, developers, etc., can understand what triggered a particular behavior. Explanations may be triggered by the neural controller, namely introspective explanations, or informed by the neural controller's output, namely rationalizations. Our work has focused on the challenge of generating introspective explanations of deep models for self-driving vehicles. In Chapter 3, we begin by exploring the use of visual explanations. These explanations take the form of real-time highlighted regions of an image that causally influence the network's output (steering control). In the first stage, we use a visual attention model to train a convolution network end-to-end from images to steering angle. The attention model highlights image regions that potentially influence the network's output. Some of these are true influences, but some are spurious. We then apply a causal filtering step to determine which input regions actually influence the output. This produces more succinct visual explanations and more accurately exposes the network's behavior. In Chapter 4, we add an attention-based video-to-text model to produce textual explanations of model actions, e.g. "the car slows down because the road is wet". The attention maps of controller and explanation model are aligned so that explanations are grounded in the parts of the scene that mattered to the controller. We explore two approaches to attention alignment, strong- and weak-alignment. These explainable systems represent an externalization of tacit knowledge. The network's opaque reasoning is simplified to a situation-specific dependence on a visible object in the image. This makes them brittle and potentially unsafe in situations that do not match training data. In Chapter 5, we propose to address this issue by augmenting training data with natural language advice from a human. Advice includes guidance about what to do and where to attend. We present the first step toward advice-giving, where we train an end-to-end vehicle controller that accepts advice. The controller adapts the way it attends to the scene (visual attention) and the control (steering and speed). Further, in Chapter 6, we propose a new approach that learns vehicle control with the help of long-term (global) human advice. Specifically, our system learns to summarize its visual observations in natural language, predict an appropriate action response (e.g. "I see a pedestrian crossing, so I stop"), and predict the controls, accordingly
- …