1 research outputs found
Hierarchical Memory Decoding for Video Captioning
Recent advances of video captioning often employ a recurrent neural network
(RNN) as the decoder. However, RNN is prone to diluting long-term information.
Recent works have demonstrated memory network (MemNet) has the advantage of
storing long-term information. However, as the decoder, it has not been well
exploited for video captioning. The reason partially comes from the difficulty
of sequence decoding with MemNet. Instead of the common practice, i.e.,
sequence decoding with RNN, in this paper, we devise a novel memory decoder for
video captioning. Concretely, after obtaining representation of each frame
through a pre-trained network, we first fuse the visual and lexical
information. Then, at each time step, we construct a multi-layer MemNet-based
decoder, i.e., in each layer, we employ a memory set to store previous
information and an attention mechanism to select the information related to the
current input. Thus, this decoder avoids the dilution of long-term information.
And the multi-layer architecture is helpful for capturing dependencies between
frames and word sequences. Experimental results show that even without the
encoding network, our decoder still could obtain competitive performance and
outperform the performance of RNN decoder. Furthermore, compared with one-layer
RNN decoder, our decoder has fewer parameters