This paper presents a new method to describe spatio-temporal relations
between objects and hands, to recognize both interactions and activities within
video demonstrations of manual tasks. The approach exploits Scene Graphs to
extract key interaction features from image sequences, encoding at the same
time motion patterns and context. Additionally, the method introduces an
event-based automatic video segmentation and clustering, which allows to group
similar events, detecting also on the fly if a monitored activity is executed
correctly. The effectiveness of the approach was demonstrated in two
multi-subject experiments, showing the ability to recognize and cluster
hand-object and object-object interactions without prior knowledge of the
activity, as well as matching the same activity performed by different
subjects.Comment: 8 pages, 8 figures, submitted to IEEE RAS International Symposium on
Robot and Human Interactive Communication (RO-MAN), for associated video see
https://youtu.be/Ftu_EHAtH4