Search CORE

27 research outputs found

Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers

Author: Asano Y.
Campbell D.
Feichtenhofer C.
Henriques J.
Metze F.
Misra I.
Patrick M.
Vedaldi A.
Publication venue: Neural Information Processing Systems Foundation
Publication date: 01/01/2022
Field of study

International Migration, Integration and Social Cohesion online publications

Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers

Author: Asano Y.
Campbell D.
Feichtenhofer C.
Henriques J.
Metze F.
Misra I.
Patrick M.
Vedaldi A.
Publication venue: Neural Information Processing Systems Foundation
Publication date: 01/01/2022
Field of study

International Migration, Integration and Social Cohesion online publications

Convolutional two-stream network fusion for video action recognition

Author: Feichtenhofer C
Pinz A
Zisserman AP
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

Recent applications of Convolutional Neural Networks (ConvNets) for human action recognition in videos have proposed different solutions for incorporating the appearance and motion information. We study a number of ways of fusing ConvNet towers both spatially and temporally in order to best take advantage of this spatio-temporal information. We make the following findings: (i) that rather than fusing at the softmax layer, a spatial and temporal network can be fused at a convolution layer without loss of performance, but with a substantial saving in parameters; (ii) that it is better to fuse such networks spatially at the last convolutional layer than earlier, and that additionally fusing at the class prediction layer can boost accuracy; finally (iii) that pooling of abstract convolutional features over spatiotemporal neighbourhoods further boosts performance. Based on these studies we propose a new ConvNet architecture for spatiotemporal fusion of video snippets, and evaluate its performance on standard benchmarks where this architecture achieves state-of-the-art results

arXiv.org e-Print Archive

Crossref

Oxford University Research Archive

Detect to track and track to detect

Author: Feichtenhofer C
Pinz A
Zisserman A
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 25/12/2017
Field of study

Recent approaches for high accuracy detection and tracking of object categories in video consist of complex multistage solutions that become more cumbersome each year. In this paper we propose a ConvNet architecture that jointly performs detection and tracking, solving the task in a simple and effective way. Our contributions are threefold: (i) we set up a ConvNet architecture for simultaneous detection and tracking, using a multi-task objective for frame-based object detection and across-frame track regression; (ii) we introduce correlation features that represent object co-occurrences across time to aid the ConvNet during tracking; and (iii) we link the frame level detections based on our across-frame tracklets to produce high accuracy detections at the video level. Our ConvNet architecture for spatiotemporal object detection is evaluated on the large-scale ImageNet VID dataset where it achieves state-of-the-art results. Our approach provides better single model performance than the winning method of the last ImageNet challenge while being conceptually much simpler. Finally, we show that by increasing the temporal stride we can dramatically increase the tracker speed

What have we learned from deep representations for action recognition?

Author: Feichtenhofer C
Pinz A
Wildes RP
Zisserman AP
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

As the success of deep models has led to their deployment in all areas of computer vision, it is increasingly important to understand how these representations work and what they are capturing. In this paper, we shed light on deep spatiotemporal representations by visualizing what two-stream models have learned in order to recognize actions in video. We show that local detectors for appearance and motion objects arise to form distributed representations for recognizing human actions. Key observations include the following. First, cross-stream fusion enables the learning of true spatiotemporal features rather than simply separate appearance and motion features. Second, the networks can learn local representations that are highly class specific, but also generic representations that can serve a range of classes. Third, throughout the hierarchy of the network, features become more abstract and show increasing invariance to aspects of the data that are unimportant to desired distinctions (e.g. motion patterns across various speeds). Fourth, visualizations can be used not only to shed light on learned representations, but also to reveal idiosyncracies of training data and to explain failure cases of the system

Oxford University Research Archive

What have we learned from deep representations for action recognition?

Author: Feichtenhofer C
Pinz A
Wildes RP
Zisserman AP
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

arXiv.org e-Print Archive

Crossref

Oxford University Research Archive

Deep insights into convolutional networks for video recognition

Author: Feichtenhofer C
Pinz A
Wildes RP
Zisserman A
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2019
Field of study

As the success of deep models has led to their deployment in all areas of computer vision, it is increasingly important to understand how these representations work and what they are capturing. In this paper, we shed light on deep spatiotemporal representations by visualizing the internal representation of models that have been trained to recognize actions in video. We visualize multiple two-stream architectures to show that local detectors for appearance and motion objects arise to form distributed representations for recognizing human actions. Key observations include the following. First, cross-stream fusion enables the learning of true spatiotemporal features rather than simply separate appearance and motion features. Second, the networks can learn local representations that are highly class specific, but also generic representations that can serve a range of classes. Third, throughout the hierarchy of the network, features become more abstract and show increasing invariance to aspects of the data that are unimportant to desired distinctions (e.g. motion patterns across various speeds). Fourth, visualizations can be used not only to shed light on learned representations, but also to reveal idiosyncrasies of training data and to explain failure cases of the system

Oxford University Research Archive

Keeping your eye on the ball: Trajectory attention in video transformers

Author: Asano Y
Campbell D
Feichtenhofer C
Henriques JF
Metze F
Misra I
Patrick M
Vedaldi A
Publication venue: Neural Information Processing Systems Foundation
Publication date: 01/01/2021
Field of study

In video transformers, the time dimension is often treated in the same way as the two spatial dimensions. However, in a scene where objects or the camera may move, a physical point imaged at one location in frame t may be entirely unrelated to what is found at that location in frame t + k. These temporal correspondences should be modeled to facilitate learning about dynamic scenes. To this end, we propose a new drop-in block for video transformers—trajectory attention—that aggregates information along implicitly determined motion paths. We additionally propose a new method to address the quadratic dependence of computation and memory on the input size, which is particularly important for high resolution or long videos. While these ideas are useful in a range of settings, we apply them to the specific task of video action recognition with a transformer model and obtain state-of-the-art results on the Kinetics, Something–Something V2, and Epic-Kitchens datasets. Code and models are available at: https://github.com/facebookresearch/Motionformer

arXiv.org e-Print Archive

Oxford University Research Archive

International Migration, Integration and Social Cohesion online publications

UvA-DARE

Recurrent Residual Learning for Action Recognition

Author: A Richard
B Fernando
C Feichtenhofer
I Laptev
K Simonyan
L Wang
M Riesenhuber
O Russakovsky
S Ji
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Image quality assessment (IQA) using high-frequency and image variance (HFIV) for colour image

Author: Choi M G
Feichtenhofer C
Gonzalez R C
Haniza Yazid
Larson E. C.
Li Chien Tan
Ponomarenko N
Yalman Y
Yen Fook Chong
Publication venue: 'IOP Publishing'
Publication date
Field of study

Crossref