914 research outputs found
A Dilated Inception Network for Visual Saliency Prediction
Recently, with the advent of deep convolutional neural networks (DCNN), the
improvements in visual saliency prediction research are impressive. One
possible direction to approach the next improvement is to fully characterize
the multi-scale saliency-influential factors with a computationally-friendly
module in DCNN architectures. In this work, we proposed an end-to-end dilated
inception network (DINet) for visual saliency prediction. It captures
multi-scale contextual features effectively with very limited extra parameters.
Instead of utilizing parallel standard convolutions with different kernel sizes
as the existing inception module, our proposed dilated inception module (DIM)
uses parallel dilated convolutions with different dilation rates which can
significantly reduce the computation load while enriching the diversity of
receptive fields in feature maps. Moreover, the performance of our saliency
model is further improved by using a set of linear normalization-based
probability distribution distance metrics as loss functions. As such, we can
formulate saliency prediction as a probability distribution prediction task for
global saliency inference instead of a typical pixel-wise regression problem.
Experimental results on several challenging saliency benchmark datasets
demonstrate that our DINet with proposed loss functions can achieve
state-of-the-art performance with shorter inference time.Comment: Accepted by IEEE Transactions on Multimedia. The source codes are
available at https://github.com/ysyscool/DINe
Ultrafast Video Attention Prediction with Coupled Knowledge Distillation
Large convolutional neural network models have recently demonstrated
impressive performance on video attention prediction. Conventionally, these
models are with intensive computation and large memory. To address these
issues, we design an extremely light-weight network with ultrafast speed, named
UVA-Net. The network is constructed based on depth-wise convolutions and takes
low-resolution images as input. However, this straight-forward acceleration
method will decrease performance dramatically. To this end, we propose a
coupled knowledge distillation strategy to augment and train the network
effectively. With this strategy, the model can further automatically discover
and emphasize implicit useful cues contained in the data. Both spatial and
temporal knowledge learned by the high-resolution complex teacher networks also
can be distilled and transferred into the proposed low-resolution light-weight
spatiotemporal network. Experimental results show that the performance of our
model is comparable to ten state-of-the-art models in video attention
prediction, while it costs only 0.68 MB memory footprint, runs about 10,106 FPS
on GPU and 404 FPS on CPU, which is 206 times faster than previous models
Two Stream LSTM: A Deep Fusion Framework for Human Action Recognition
In this paper we address the problem of human action recognition from video
sequences. Inspired by the exemplary results obtained via automatic feature
learning and deep learning approaches in computer vision, we focus our
attention towards learning salient spatial features via a convolutional neural
network (CNN) and then map their temporal relationship with the aid of
Long-Short-Term-Memory (LSTM) networks. Our contribution in this paper is a
deep fusion framework that more effectively exploits spatial features from CNNs
with temporal features from LSTM models. We also extensively evaluate their
strengths and weaknesses. We find that by combining both the sets of features,
the fully connected features effectively act as an attention mechanism to
direct the LSTM to interesting parts of the convolutional feature sequence. The
significance of our fusion method is its simplicity and effectiveness compared
to other state-of-the-art methods. The evaluation results demonstrate that this
hierarchical multi stream fusion method has higher performance compared to
single stream mapping methods allowing it to achieve high accuracy
outperforming current state-of-the-art methods in three widely used databases:
UCF11, UCFSports, jHMDB.Comment: Published as a conference paper at WACV 201
Unified Image and Video Saliency Modeling
Visual saliency modeling for images and videos is treated as two independent
tasks in recent computer vision literature. While image saliency modeling is a
well-studied problem and progress on benchmarks like SALICON and MIT300 is
slowing, video saliency models have shown rapid gains on the recent DHF1K
benchmark. Here, we take a step back and ask: Can image and video saliency
modeling be approached via a unified model, with mutual benefit? We identify
different sources of domain shift between image and video saliency data and
between different video saliency datasets as a key challenge for effective
joint modelling. To address this we propose four novel domain adaptation
techniques - Domain-Adaptive Priors, Domain-Adaptive Fusion, Domain-Adaptive
Smoothing and Bypass-RNN - in addition to an improved formulation of learned
Gaussian priors. We integrate these techniques into a simple and lightweight
encoder-RNN-decoder-style network, UNISAL, and train it jointly with image and
video saliency data. We evaluate our method on the video saliency datasets
DHF1K, Hollywood-2 and UCF-Sports, and the image saliency datasets SALICON and
MIT300. With one set of parameters, UNISAL achieves state-of-the-art
performance on all video saliency datasets and is on par with the
state-of-the-art for image saliency datasets, despite faster runtime and a 5 to
20-fold smaller model size compared to all competing deep methods. We provide
retrospective analyses and ablation studies which confirm the importance of the
domain shift modeling. The code is available at
https://github.com/rdroste/unisalComment: Presented at the European Conference on Computer Vision (ECCV) 2020.
R. Droste and J. Jiao contributed equally to this work. v3: Updated Fig. 5a)
and added new MTI300 benchmark results to supp. materia
- …