Among the existing modalities for 3D action recognition, 3D flow has been
poorly examined, although conveying rich motion information cues for human
actions. Presumably, its susceptibility to noise renders it intractable, thus
challenging the learning process within deep models. This work demonstrates the
use of 3D flow sequence by a deep spatiotemporal model and further proposes an
incremental two-level spatial attention mechanism, guided from skeleton domain,
for emphasizing motion features close to the body joint areas and according to
their informativeness. Towards this end, an extended deep skeleton model is
also introduced to learn the most discriminant action motion dynamics, so as to
estimate an informativeness score for each joint. Subsequently, a late fusion
scheme is adopted between the two models for learning the high level
cross-modal correlations. Experimental results on the currently largest and
most challenging dataset NTU RGB+D, demonstrate the effectiveness of the
proposed approach, achieving state-of-the-art results.Comment: 18 pages, 3 figures, 3 tables, conferenc