1 research outputs found
SBAT: Video Captioning with Sparse Boundary-Aware Transformer
In this paper, we focus on the problem of applying the transformer structure
to video captioning effectively. The vanilla transformer is proposed for
uni-modal language generation task such as machine translation. However, video
captioning is a multimodal learning problem, and the video features have much
redundancy between different time steps. Based on these concerns, we propose a
novel method called sparse boundary-aware transformer (SBAT) to reduce the
redundancy in video representation. SBAT employs boundary-aware pooling
operation for scores from multihead attention and selects diverse features from
different scenarios. Also, SBAT includes a local correlation scheme to
compensate for the local information loss brought by sparse operation. Based on
SBAT, we further propose an aligned cross-modal encoding scheme to boost the
multimodal interaction. Experimental results on two benchmark datasets show
that SBAT outperforms the state-of-the-art methods under most of the metrics.Comment: Appearing at IJCAI 202