We introduce an object-aware decoder for improving the performance of
spatio-temporal representations on ego-centric videos. The key idea is to
enhance object-awareness during training by tasking the model to predict hand
positions, object positions, and the semantic label of the objects using paired
captions when available. At inference time the model only requires RGB frames
as inputs, and is able to track and ground objects (although it has not been
trained explicitly for this). We demonstrate the performance of the
object-aware representations learnt by our model, by: (i) evaluating it for
strong transfer, i.e. through zero-shot testing, on a number of downstream
video-text retrieval and classification benchmarks; and (ii) by using the
representations learned as input for long-term video understanding tasks (e.g.
Episodic Memory in Ego4D). In all cases the performance improves over the state
of the art -- even compared to networks trained with far larger batch sizes. We
also show that by using noisy image-level detection as pseudo-labels in
training, the model learns to provide better bounding boxes using video
consistency, as well as grounding the words in the associated text
descriptions. Overall, we show that the model can act as a drop-in replacement
for an ego-centric video model to improve performance through visual-text
grounding.Comment: ICCV202