Many videos depict people, and it is their interactions that inform us of
their activities, relation to one another and the cultural and social setting.
With advances in human action recognition, researchers have begun to address
the automated recognition of these human-human interactions from video. The
main challenges stem from dealing with the considerable variation in recording
setting, the appearance of the people depicted and the coordinated performance
of their interaction. This survey provides a summary of these challenges and
datasets to address these, followed by an in-depth discussion of relevant
vision-based recognition and detection methods. We focus on recent, promising
work based on deep learning and convolutional neural networks (CNNs). Finally,
we outline directions to overcome the limitations of the current
state-of-the-art to analyze and, eventually, understand social human actions