Most modern approaches in temporal action localization divide this problem
into two parts: (i) short-term feature extraction and (ii) long-range temporal
boundary localization. Due to the high GPU memory cost caused by processing
long untrimmed videos, many methods sacrifice the representational power of the
short-term feature extractor by either freezing the backbone or using a very
small spatial video resolution. This issue becomes even worse with the recent
video transformer models, many of which have quadratic memory complexity. To
address these issues, we propose TALLFormer, a memory-efficient and end-to-end
trainable Temporal Action Localization transformer with Long-term memory. Our
long-term memory mechanism eliminates the need for processing hundreds of
redundant video frames during each training iteration, thus, significantly
reducing the GPU memory consumption and training time. These efficiency savings
allow us (i) to use a powerful video transformer-based feature extractor
without freezing the backbone or reducing the spatial video resolution, while
(ii) also maintaining long-range temporal boundary localization capability.
With only RGB frames as input and no external action recognition classifier,
TALLFormer outperforms previous state-of-the-art methods by a large margin,
achieving an average mAP of 59.1% on THUMOS14 and 35.6% on ActivityNet-1.3. The
code will be available in https://github.com/klauscc/TALLFormer.Comment: 15 pages, 2 figure