58,719 research outputs found
Video Prediction via Example Guidance
In video prediction tasks, one major challenge is to capture the multi-modal
nature of future contents and dynamics. In this work, we propose a simple yet
effective framework that can efficiently predict plausible future states. The
key insight is that the potential distribution of a sequence could be
approximated with analogous ones in a repertoire of training pool, namely,
expert examples. By further incorporating a novel optimization scheme into the
training procedure, plausible predictions can be sampled efficiently from
distribution constructed from the retrieved examples. Meanwhile, our method
could be seamlessly integrated with existing stochastic predictive models;
significant enhancement is observed with comprehensive experiments in both
quantitative and qualitative aspects. We also demonstrate the generalization
ability to predict the motion of unseen class, i.e., without access to
corresponding data during training phase.Comment: Project Page: https://sites.google.com/view/vpeg-supp/hom
Generating Video Descriptions with Topic Guidance
Generating video descriptions in natural language (a.k.a. video captioning)
is a more challenging task than image captioning as the videos are
intrinsically more complicated than images in two aspects. First, videos cover
a broader range of topics, such as news, music, sports and so on. Second,
multiple topics could coexist in the same video. In this paper, we propose a
novel caption model, topic-guided model (TGM), to generate topic-oriented
descriptions for videos in the wild via exploiting topic information. In
addition to predefined topics, i.e., category tags crawled from the web, we
also mine topics in a data-driven way based on training captions by an
unsupervised topic mining model. We show that data-driven topics reflect a
better topic schema than the predefined topics. As for testing video topic
prediction, we treat the topic mining model as teacher to train the student,
the topic prediction model, by utilizing the full multi-modalities in the video
especially the speech modality. We propose a series of caption models to
exploit topic guidance, including implicitly using the topics as input features
to generate words related to the topic and explicitly modifying the weights in
the decoder with topics to function as an ensemble of topic-aware language
decoders. Our comprehensive experimental results on the current largest video
caption dataset MSR-VTT prove the effectiveness of our topic-guided model,
which significantly surpasses the winning performance in the 2016 MSR video to
language challenge.Comment: Appeared at ICMR 201
Digging Deeper into Egocentric Gaze Prediction
This paper digs deeper into factors that influence egocentric gaze. Instead
of training deep models for this purpose in a blind manner, we propose to
inspect factors that contribute to gaze guidance during daily tasks. Bottom-up
saliency and optical flow are assessed versus strong spatial prior baselines.
Task-specific cues such as vanishing point, manipulation point, and hand
regions are analyzed as representatives of top-down information. We also look
into the contribution of these factors by investigating a simple recurrent
neural model for ego-centric gaze prediction. First, deep features are
extracted for all input video frames. Then, a gated recurrent unit is employed
to integrate information over time and to predict the next fixation. We also
propose an integrated model that combines the recurrent model with several
top-down and bottom-up cues. Extensive experiments over multiple datasets
reveal that (1) spatial biases are strong in egocentric videos, (2) bottom-up
saliency models perform poorly in predicting gaze and underperform spatial
biases, (3) deep features perform better compared to traditional features, (4)
as opposed to hand regions, the manipulation point is a strong influential cue
for gaze prediction, (5) combining the proposed recurrent model with bottom-up
cues, vanishing points and, in particular, manipulation point results in the
best gaze prediction accuracy over egocentric videos, (6) the knowledge
transfer works best for cases where the tasks or sequences are similar, and (7)
task and activity recognition can benefit from gaze prediction. Our findings
suggest that (1) there should be more emphasis on hand-object interaction and
(2) the egocentric vision community should consider larger datasets including
diverse stimuli and more subjects.Comment: presented at WACV 201
A probabilistic tour of visual attention and gaze shift computational models
In this paper a number of problems are considered which are related to the
modelling of eye guidance under visual attention in a natural setting. From a
crude discussion of a variety of available models spelled in probabilistic
terms, it appears that current approaches in computational vision are hitherto
far from achieving the goal of an active observer relying upon eye guidance to
accomplish real-world tasks. We argue that this challenging goal not only
requires to embody, in a principled way, the problem of eye guidance within the
action/perception loop, but to face the inextricable link tying up visual
attention, emotion and executive control, in so far as recent neurobiological
findings are weighed up
Cube Padding for Weakly-Supervised Saliency Prediction in 360{\deg} Videos
Automatic saliency prediction in 360{\deg} videos is critical for viewpoint
guidance applications (e.g., Facebook 360 Guide). We propose a spatial-temporal
network which is (1) weakly-supervised trained and (2) tailor-made for
360{\deg} viewing sphere. Note that most existing methods are less scalable
since they rely on annotated saliency map for training. Most importantly, they
convert 360{\deg} sphere to 2D images (e.g., a single equirectangular image or
multiple separate Normal Field-of-View (NFoV) images) which introduces
distortion and image boundaries. In contrast, we propose a simple and effective
Cube Padding (CP) technique as follows. Firstly, we render the 360{\deg} view
on six faces of a cube using perspective projection. Thus, it introduces very
little distortion. Then, we concatenate all six faces while utilizing the
connectivity between faces on the cube for image padding (i.e., Cube Padding)
in convolution, pooling, convolutional LSTM layers. In this way, CP introduces
no image boundary while being applicable to almost all Convolutional Neural
Network (CNN) structures. To evaluate our method, we propose Wild-360, a new
360{\deg} video saliency dataset, containing challenging videos with saliency
heatmap annotations. In experiments, our method outperforms baseline methods in
both speed and quality.Comment: CVPR 201
DISC: Deep Image Saliency Computing via Progressive Representation Learning
Salient object detection increasingly receives attention as an important
component or step in several pattern recognition and image processing tasks.
Although a variety of powerful saliency models have been intensively proposed,
they usually involve heavy feature (or model) engineering based on priors (or
assumptions) about the properties of objects and backgrounds. Inspired by the
effectiveness of recently developed feature learning, we provide a novel Deep
Image Saliency Computing (DISC) framework for fine-grained image saliency
computing. In particular, we model the image saliency from both the coarse- and
fine-level observations, and utilize the deep convolutional neural network
(CNN) to learn the saliency representation in a progressive manner.
Specifically, our saliency model is built upon two stacked CNNs. The first CNN
generates a coarse-level saliency map by taking the overall image as the input,
roughly identifying saliency regions in the global context. Furthermore, we
integrate superpixel-based local context information in the first CNN to refine
the coarse-level saliency map. Guided by the coarse saliency map, the second
CNN focuses on the local context to produce fine-grained and accurate saliency
map while preserving object details. For a testing image, the two CNNs
collaboratively conduct the saliency computing in one shot. Our DISC framework
is capable of uniformly highlighting the objects-of-interest from complex
background while preserving well object details. Extensive experiments on
several standard benchmarks suggest that DISC outperforms other
state-of-the-art methods and it also generalizes well across datasets without
additional training. The executable version of DISC is available online:
http://vision.sysu.edu.cn/projects/DISC.Comment: This manuscript is the accepted version for IEEE Transactions on
Neural Networks and Learning Systems (T-NNLS), 201
Unified Depth Prediction and Intrinsic Image Decomposition from a Single Image via Joint Convolutional Neural Fields
We present a method for jointly predicting a depth map and intrinsic images
from single-image input. The two tasks are formulated in a synergistic manner
through a joint conditional random field (CRF) that is solved using a novel
convolutional neural network (CNN) architecture, called the joint convolutional
neural field (JCNF) model. Tailored to our joint estimation problem, JCNF
differs from previous CNNs in its sharing of convolutional activations and
layers between networks for each task, its inference in the gradient domain
where there exists greater correlation between depth and intrinsic images, and
the incorporation of a gradient scale network that learns the confidence of
estimated gradients in order to effectively balance them in the solution. This
approach is shown to surpass state-of-the-art methods both on single-image
depth estimation and on intrinsic image decomposition
Human-Guided Learning of Column Networks: Augmenting Deep Learning with Advice
Recently, deep models have been successfully applied in several applications,
especially with low-level representations. However, sparse, noisy samples and
structured domains (with multiple objects and interactions) are some of the
open challenges in most deep models. Column Networks, a deep architecture, can
succinctly capture such domain structure and interactions, but may still be
prone to sub-optimal learning from sparse and noisy samples. Inspired by the
success of human-advice guided learning in AI, especially in data-scarce
domains, we propose Knowledge-augmented Column Networks that leverage human
advice/knowledge for better learning with noisy/sparse samples. Our experiments
demonstrate that our approach leads to either superior overall performance or
faster convergence (i.e., both effective and efficient).Comment: Under Review at 'Machine Learning Journal' (MLJ
Deep Motion Boundary Detection
Motion boundary detection is a crucial yet challenging problem. Prior methods
focus on analyzing the gradients and distributions of optical flow fields, or
use hand-crafted features for motion boundary learning. In this paper, we
propose the first dedicated end-to-end deep learning approach for motion
boundary detection, which we term as MoBoNet. We introduce a refinement network
structure which takes source input images, initial forward and backward optical
flows as well as corresponding warping errors as inputs and produces
high-resolution motion boundaries. Furthermore, we show that the obtained
motion boundaries, through a fusion sub-network we design, can in turn guide
the optical flows for removing the artifacts. The proposed MoBoNet is generic
and works with any optical flows. Our motion boundary detection and the refined
optical flow estimation achieve results superior to the state of the art.Comment: 17 pages, 5 figure
HMS-Net: Hierarchical Multi-scale Sparsity-invariant Network for Sparse Depth Completion
Dense depth cues are important and have wide applications in various computer
vision tasks. In autonomous driving, LIDAR sensors are adopted to acquire depth
measurements around the vehicle to perceive the surrounding environments.
However, depth maps obtained by LIDAR are generally sparse because of its
hardware limitation. The task of depth completion attracts increasing
attention, which aims at generating a dense depth map from an input sparse
depth map. To effectively utilize multi-scale features, we propose three novel
sparsity-invariant operations, based on which, a sparsity-invariant multi-scale
encoder-decoder network (HMS-Net) for handling sparse inputs and sparse feature
maps is also proposed. Additional RGB features could be incorporated to further
improve the depth completion performance. Our extensive experiments and
component analysis on two public benchmarks, KITTI depth completion benchmark
and NYU-depth-v2 dataset, demonstrate the effectiveness of the proposed
approach. As of Aug. 12th, 2018, on KITTI depth completion leaderboard, our
proposed model without RGB guidance ranks first among all peer-reviewed methods
without using RGB information, and our model with RGB guidance ranks second
among all RGB-guided methods.Comment: IEEE Trans. on Image Processin
- …