2,444 research outputs found
Non-local Neural Networks
Both convolutional and recurrent operations are building blocks that process
one local neighborhood at a time. In this paper, we present non-local
operations as a generic family of building blocks for capturing long-range
dependencies. Inspired by the classical non-local means method in computer
vision, our non-local operation computes the response at a position as a
weighted sum of the features at all positions. This building block can be
plugged into many computer vision architectures. On the task of video
classification, even without any bells and whistles, our non-local models can
compete or outperform current competition winners on both Kinetics and Charades
datasets. In static image recognition, our non-local models improve object
detection/segmentation and pose estimation on the COCO suite of tasks. Code is
available at https://github.com/facebookresearch/video-nonlocal-net .Comment: CVPR 2018, code is available at:
https://github.com/facebookresearch/video-nonlocal-ne
Recurrent Segmentation for Variable Computational Budgets
State-of-the-art systems for semantic image segmentation use feed-forward
pipelines with fixed computational costs. Building an image segmentation system
that works across a range of computational budgets is challenging and
time-intensive as new architectures must be designed and trained for every
computational setting. To address this problem we develop a recurrent neural
network that successively improves prediction quality with each iteration.
Importantly, the RNN may be deployed across a range of computational budgets by
merely running the model for a variable number of iterations. We find that this
architecture is uniquely suited for efficiently segmenting videos. By
exploiting the segmentation of past frames, the RNN can perform video
segmentation at similar quality but reduced computational cost compared to
state-of-the-art image segmentation methods. When applied to static images in
the PASCAL VOC 2012 and Cityscapes segmentation datasets, the RNN traces out a
speed-accuracy curve that saturates near the performance of state-of-the-art
segmentation methods
Two-Stream Action Recognition-Oriented Video Super-Resolution
We study the video super-resolution (SR) problem for facilitating video
analytics tasks, e.g. action recognition, instead of for visual quality. The
popular action recognition methods based on convolutional networks, exemplified
by two-stream networks, are not directly applicable on video of low spatial
resolution. This can be remedied by performing video SR prior to recognition,
which motivates us to improve the SR procedure for recognition accuracy.
Tailored for two-stream action recognition networks, we propose two video SR
methods for the spatial and temporal streams respectively. On the one hand, we
observe that regions with action are more important to recognition, and we
propose an optical-flow guided weighted mean-squared-error loss for our
spatial-oriented SR (SoSR) network to emphasize the reconstruction of moving
objects. On the other hand, we observe that existing video SR methods incur
temporal discontinuity between frames, which also worsens the recognition
accuracy, and we propose a siamese network for our temporal-oriented SR (ToSR)
training that emphasizes the temporal continuity between consecutive frames. We
perform experiments using two state-of-the-art action recognition networks and
two well-known datasets--UCF101 and HMDB51. Results demonstrate the
effectiveness of our proposed SoSR and ToSR in improving recognition accuracy.Comment: Accepted to ICCV 2019. Code:
https://github.com/AlanZhang1995/TwoStreamS
- …