Accurate affordance detection and segmentation with pixel precision is an
important piece in many complex systems based on interactions, such as robots
and assitive devices. We present a new approach to affordance perception which
enables accurate multi-label segmentation. Our approach can be used to
automatically extract grounded affordances from first person videos of
interactions using a 3D map of the environment providing pixel level precision
for the affordance location. We use this method to build the largest and most
complete dataset on affordances based on the EPIC-Kitchen dataset, EPIC-Aff,
which provides interaction-grounded, multi-label, metric and spatial affordance
annotations. Then, we propose a new approach to affordance segmentation based
on multi-label detection which enables multiple affordances to co-exists in the
same space, for example if they are associated with the same object. We present
several strategies of multi-label detection using several segmentation
architectures. The experimental results highlight the importance of the
multi-label detection. Finally, we show how our metric representation can be
exploited for build a map of interaction hotspots in spatial action-centric
zones and use that representation to perform a task-oriented navigation.Comment: International Conference on Computer Vision (ICCV) 202