Existing approaches to image captioning usually generate the sentence
word-by-word from left to right, with the constraint of conditioned on local
context including the given image and history generated words. There have been
many studies target to make use of global information during decoding, e.g.,
iterative refinement. However, it is still under-explored how to effectively
and efficiently incorporate the future context. To respond to this issue,
inspired by that Non-Autoregressive Image Captioning (NAIC) can leverage
two-side relation with modified mask operation, we aim to graft this advance to
the conventional Autoregressive Image Captioning (AIC) model while maintaining
the inference efficiency without extra time cost. Specifically, AIC and NAIC
models are first trained combined with shared visual encoders, forcing the
visual encoder to contain sufficient and valid future context; then the AIC
model is encouraged to capture the causal dynamics of cross-layer interchanging
from NAIC model on its unconfident words, which follows a teacher-student
paradigm and optimized with the distribution calibration training objective.
Empirical evidences demonstrate that our proposed approach clearly surpass the
state-of-the-art baselines in both automatic metrics and human evaluations on
the MS COCO benchmark. The source code is available at:
https://github.com/feizc/Future-Caption.Comment: ACM Multimedia 202