Multi-Temporal Convolutions for Human Action Recognition in Videos

Poppe, Ronald; Stergiou, Alexandros

Multi-Temporal Convolutions for Human Action Recognition in Videos

Authors: Ronald Poppe
Alexandros Stergiou
Publication date: 31 March 2021
Publisher
Doi

Abstract

Effective extraction of temporal patterns is crucial for the recognition of temporally varying actions in video. We argue that the fixed-sized spatio-temporal convolution kernels used in convolutional neural networks (CNNs) can be improved to extract informative motions that are executed at different time scales. To address this challenge, we present a novel spatio-temporal convolution block that is capable of extracting spatio-temporal patterns at multiple temporal resolutions. Our proposed multi-temporal convolution (MTConv) blocks utilize two branches that focus on brief and prolonged spatio-temporal patterns, respectively. The extracted time-varying features are aligned in a third branch, with respect to global motion patterns through recurrent cells. The proposed blocks are lightweight and can be integrated into any 3D-CNN architecture. This introduces a substantial reduction in computational costs. Extensive experiments on Kinetics, Moments in Time and HACS action recognition benchmark datasets demonstrate competitive performance of MTConvs compared to the state-of-the-art with a significantly lower computational footprint

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

Supporting member

Explore Bristol Research

oai:research-information.bris....

Last time updated on 08/06/2022

Utrecht University Repository

oai:dspace.library.uu.nl:1874/...

Last time updated on 16/05/2023