Effective extraction of temporal patterns is crucial for the recognition of
temporally varying actions in video. We argue that the fixed-sized
spatio-temporal convolution kernels used in convolutional neural networks
(CNNs) can be improved to extract informative motions that are executed at
different time scales. To address this challenge, we present a novel
spatio-temporal convolution block that is capable of extracting spatio-temporal
patterns at multiple temporal resolutions. Our proposed multi-temporal
convolution (MTConv) blocks utilize two branches that focus on brief and
prolonged spatio-temporal patterns, respectively. The extracted time-varying
features are aligned in a third branch, with respect to global motion patterns
through recurrent cells. The proposed blocks are lightweight and can be
integrated into any 3D-CNN architecture. This introduces a substantial
reduction in computational costs. Extensive experiments on Kinetics, Moments in
Time and HACS action recognition benchmark datasets demonstrate competitive
performance of MTConvs compared to the state-of-the-art with a significantly
lower computational footprint