Learning to understand and infer object functionalities is an important step
towards robust visual intelligence. Significant research efforts have recently
focused on segmenting the object parts that enable specific types of
human-object interaction, the so-called "object affordances". However, most
works treat it as a static semantic segmentation problem, focusing solely on
object appearance and relying on strong supervision and object detection. In
this paper, we propose a novel approach that exploits the spatio-temporal
nature of human-object interaction for affordance segmentation. In particular,
we design an autoencoder that is trained using ground-truth labels of only the
last frame of the sequence, and is able to infer pixel-wise affordance labels
in both videos and static images. Our model surpasses the need for object
labels and bounding boxes by using a soft-attention mechanism that enables the
implicit localization of the interaction hotspot. For evaluation purposes, we
introduce the SOR3D-AFF corpus, which consists of human-object interaction
sequences and supports 9 types of affordances in terms of pixel-wise
annotation, covering typical manipulations of tool-like objects. We show that
our model achieves competitive results compared to strongly supervised methods
on SOR3D-AFF, while being able to predict affordances for similar unseen
objects in two affordance image-only datasets.Comment: 5 pages, 4 figures, ICASSP 202