Learning temporal variations for action recognition

Abstract

As a core problem in video analysis, action recognition is of great significance for many higher-level tasks, both in research and industrial applications. With more and more video data being produced and shared daily, effective automatic action recognition methods are needed. Although, many deep-learning methods have been proposed to solve the problem, recent research reveals that single-stream, RGB-based networks are always outperformed by two-stream networks using both RGB and optical flow as inputs. This dependence on optical flow, which indicates a deficiency in learning motion, is present not only in 2D networks but also in 3D networks. This is somewhat surprising since 3D networks are explicitly designed for spatio-temporal learning. In this thesis, we assume that this deficiency is caused by difficulties associated with learning from videos exhibiting strong temporal variations, such as sudden motion, occlusions, acceleration, or deceleration. Temporal variations occur commonly in real-world videos and force a neural network to account for them, but often are not useful for recognizing actions at coarse granularity. We propose a Dynamic Equilibrium Module (DEM) for spatio-temporal learning through adaptive Eulerian motion manipulation. The proposed module can be inserted into existing networks with separate spatial and temporal convolutions, like the R(2+1)D model, to effectively handle temporal video variations and learn more robust spatio-temporal features. We demonstrate performance gains due to the use of DEM in the R(2+1)D model on miniKinetics, UCF-101, and HMDB-51 datasets

    Similar works