We study the task of object interaction anticipation in egocentric videos.
Successful prediction of future actions and objects requires an understanding
of the spatio-temporal context formed by past actions and object relationships.
We propose TransFusion, a multimodal transformer-based architecture, that
effectively makes use of the representational power of language by summarizing
past actions concisely. TransFusion leverages pre-trained image captioning
models and summarizes the caption, focusing on past actions and objects. This
action context together with a single input frame is processed by a multimodal
fusion module to forecast the next object interactions. Our model enables more
efficient end-to-end learning by replacing dense video features with language
representations, allowing us to benefit from knowledge encoded in large
pre-trained models. Experiments on Ego4D and EPIC-KITCHENS-100 show the
effectiveness of our multimodal fusion model and the benefits of using
language-based context summaries. Our method outperforms state-of-the-art
approaches by 40.4% in overall mAP on the Ego4D test set. We show the
generality of TransFusion via experiments on EPIC-KITCHENS-100. Video and code
are available at: https://eth-ait.github.io/transfusion-proj/