This work aims at advancing temporal action detection (TAD) using an
encoder-decoder framework with action queries, similar to DETR, which has shown
great success in object detection. However, the framework suffers from several
problems if directly applied to TAD: the insufficient exploration of
inter-query relation in the decoder, the inadequate classification training due
to a limited number of training samples, and the unreliable classification
scores at inference. To this end, we first propose a relational attention
mechanism in the decoder, which guides the attention among queries based on
their relations. Moreover, we propose two losses to facilitate and stabilize
the training of action classification. Lastly, we propose to predict the
localization quality of each action query at inference in order to distinguish
high-quality queries. The proposed method, named ReAct, achieves the
state-of-the-art performance on THUMOS14, with much lower computational costs
than previous methods. Besides, extensive ablation studies are conducted to
verify the effectiveness of each proposed component. The code is available at
https://github.com/sssste/React.Comment: ECCV202