Summarize the Past to Predict the Future: Natural Language Descriptions
  of Context Boost Multimodal Object Interaction

Gavryushin, Alexey; Hilliges, Otmar; Kuo, Yen-Ling; Pasca, Razvan-George; Wang, Xi

Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction

Authors: Alexey Gavryushin
Otmar Hilliges
Yen-Ling Kuo
Razvan-George Pasca
Xi Wang
Publication date: 22 January 2023
Publisher

Abstract

We study the task of object interaction anticipation in egocentric videos. Successful prediction of future actions and objects requires an understanding of the spatio-temporal context formed by past actions and object relationships. We propose TransFusion, a multimodal transformer-based architecture, that effectively makes use of the representational power of language by summarizing past actions concisely. TransFusion leverages pre-trained image captioning models and summarizes the caption, focusing on past actions and objects. This action context together with a single input frame is processed by a multimodal fusion module to forecast the next object interactions. Our model enables more efficient end-to-end learning by replacing dense video features with language representations, allowing us to benefit from knowledge encoded in large pre-trained models. Experiments on Ego4D and EPIC-KITCHENS-100 show the effectiveness of our multimodal fusion model and the benefits of using language-based context summaries. Our method outperforms state-of-the-art approaches by 40.4% in overall mAP on the Ego4D test set. We show the generality of TransFusion via experiments on EPIC-KITCHENS-100. Video and code are available at: https://eth-ait.github.io/transfusion-proj/

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2301.09209

Last time updated on 04/02/2023