841 research outputs found
Saliency Prediction in the Deep Learning Era: Successes, Limitations, and Future Challenges
Visual saliency models have enjoyed a big leap in performance in recent
years, thanks to advances in deep learning and large scale annotated data.
Despite enormous effort and huge breakthroughs, however, models still fall
short in reaching human-level accuracy. In this work, I explore the landscape
of the field emphasizing on new deep saliency models, benchmarks, and datasets.
A large number of image and video saliency models are reviewed and compared
over two image benchmarks and two large scale video datasets. Further, I
identify factors that contribute to the gap between models and humans and
discuss remaining issues that need to be addressed to build the next generation
of more powerful saliency models. Some specific questions that are addressed
include: in what ways current models fail, how to remedy them, what can be
learned from cognitive studies of attention, how explicit saliency judgments
relate to fixations, how to conduct fair model comparison, and what are the
emerging applications of saliency models
Spatio-Temporal Saliency Networks for Dynamic Saliency Prediction
Computational saliency models for still images have gained significant
popularity in recent years. Saliency prediction from videos, on the other hand,
has received relatively little interest from the community. Motivated by this,
in this work, we study the use of deep learning for dynamic saliency prediction
and propose the so-called spatio-temporal saliency networks. The key to our
models is the architecture of two-stream networks where we investigate
different fusion mechanisms to integrate spatial and temporal information. We
evaluate our models on the DIEM and UCF-Sports datasets and present highly
competitive results against the existing state-of-the-art models. We also carry
out some experiments on a number of still images from the MIT300 dataset by
exploiting the optical flow maps predicted from these images. Our results show
that considering inherent motion information in this way can be helpful for
static saliency estimation
Object-based visual attention for computer vision
AbstractIn this paper, a novel model of object-based visual attention extending Duncan's Integrated Competition Hypothesis [Phil. Trans. R. Soc. London B 353 (1998) 1307–1317] is presented. In contrast to the attention mechanisms used in most previous machine vision systems which drive attention based on the spatial location hypothesis, the mechanisms which direct visual attention in our system are object-driven as well as feature-driven. The competition to gain visual attention occurs not only within an object but also between objects. For this purpose, two new mechanisms in the proposed model are described and analyzed in detail. The first mechanism computes the visual salience of objects and groupings; the second one implements the hierarchical selectivity of attentional shifts. The results of the new approach on synthetic and natural images are reported
Semantic segmentation priors for object discovery
© 20xx IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Reliable object discovery in realistic indoor scenes is a necessity for many computer vision and service robot applications. In these scenes, semantic segmentation methods have made huge advances in recent years. Such methods can provide useful prior information for object discovery by removing false positives and by delineating object boundaries. We propose a novel method that combines bottom-up object discovery and semantic priors for producing generic object candidates in RGB-D images. We use a deep learning method for semantic segmentation to classify colour and depth superpixels into meaningful categories. Separately for each category, we use saliency to estimate the location and scale of objects, and superpixels to find their precise boundaries. Finally, object candidates of all categories are combined and ranked. We evaluate our approach on the NYU Depth V2 dataset and show that we outperform other state-of-the-art object discovery methods in terms of recall.Peer ReviewedPostprint (author's final draft
Segmentation of Skin Lesions and their Attributes Using Multi-Scale Convolutional Neural Networks and Domain Specific Augmentations
Computer-aided diagnosis systems for classification of different type of skin
lesions have been an active field of research in recent decades. It has been
shown that introducing lesions and their attributes masks into lesion
classification pipeline can greatly improve the performance. In this paper, we
propose a framework by incorporating transfer learning for segmenting lesions
and their attributes based on the convolutional neural networks. The proposed
framework is based on the encoder-decoder architecture which utilizes a variety
of pre-trained networks in the encoding path and generates the prediction map
by combining multi-scale information in decoding path using a pyramid pooling
manner. To address the lack of training data and increase the proposed model
generalization, an extensive set of novel domain-specific augmentation routines
have been applied to simulate the real variations in dermoscopy images.
Finally, by performing broad experiments on three different data sets obtained
from International Skin Imaging Collaboration archive (ISIC2016, ISIC2017, and
ISIC2018 challenges data sets), we show that the proposed method outperforms
other state-of-the-art approaches for ISIC2016 and ISIC2017 segmentation task
and achieved the first rank on the leader-board of ISIC2018 attribute detection
task.Comment: 18 page
Recommended from our members
Explainable and Advisable Learning for Self-driving Vehicles
Deep neural perception and control networks are likely to be a key component of self-driving vehicles. These models need to be explainable - they should provide easy-to-interpret rationales for their behavior - so that passengers, insurance companies, law enforcement, developers, etc., can understand what triggered a particular behavior. Explanations may be triggered by the neural controller, namely introspective explanations, or informed by the neural controller's output, namely rationalizations. Our work has focused on the challenge of generating introspective explanations of deep models for self-driving vehicles. In Chapter 3, we begin by exploring the use of visual explanations. These explanations take the form of real-time highlighted regions of an image that causally influence the network's output (steering control). In the first stage, we use a visual attention model to train a convolution network end-to-end from images to steering angle. The attention model highlights image regions that potentially influence the network's output. Some of these are true influences, but some are spurious. We then apply a causal filtering step to determine which input regions actually influence the output. This produces more succinct visual explanations and more accurately exposes the network's behavior. In Chapter 4, we add an attention-based video-to-text model to produce textual explanations of model actions, e.g. "the car slows down because the road is wet". The attention maps of controller and explanation model are aligned so that explanations are grounded in the parts of the scene that mattered to the controller. We explore two approaches to attention alignment, strong- and weak-alignment. These explainable systems represent an externalization of tacit knowledge. The network's opaque reasoning is simplified to a situation-specific dependence on a visible object in the image. This makes them brittle and potentially unsafe in situations that do not match training data. In Chapter 5, we propose to address this issue by augmenting training data with natural language advice from a human. Advice includes guidance about what to do and where to attend. We present the first step toward advice-giving, where we train an end-to-end vehicle controller that accepts advice. The controller adapts the way it attends to the scene (visual attention) and the control (steering and speed). Further, in Chapter 6, we propose a new approach that learns vehicle control with the help of long-term (global) human advice. Specifically, our system learns to summarize its visual observations in natural language, predict an appropriate action response (e.g. "I see a pedestrian crossing, so I stop"), and predict the controls, accordingly
SBNet: Sparse Blocks Network for Fast Inference
Conventional deep convolutional neural networks (CNNs) apply convolution
operators uniformly in space across all feature maps for hundreds of layers -
this incurs a high computational cost for real-time applications. For many
problems such as object detection and semantic segmentation, we are able to
obtain a low-cost computation mask, either from a priori problem knowledge, or
from a low-resolution segmentation network. We show that such computation masks
can be used to reduce computation in the high-resolution main network. Variants
of sparse activation CNNs have previously been explored on small-scale tasks
and showed no degradation in terms of object classification accuracy, but often
measured gains in terms of theoretical FLOPs without realizing a practical
speed-up when compared to highly optimized dense convolution implementations.
In this work, we leverage the sparsity structure of computation masks and
propose a novel tiling-based sparse convolution algorithm. We verified the
effectiveness of our sparse CNN on LiDAR-based 3D object detection, and we
report significant wall-clock speed-ups compared to dense convolution without
noticeable loss of accuracy.Comment: 10 pages, CVPR 201
- …