Communicating shapes our social word. For a robot to be considered social and
being consequently integrated in our social environment it is fundamental to
understand some of the dynamics that rule human-human communication. In this
work, we tackle the problem of Addressee Estimation, the ability to understand
an utterance's addressee, by interpreting and exploiting non-verbal bodily cues
from the speaker. We do so by implementing an hybrid deep learning model
composed of convolutional layers and LSTM cells taking as input images
portraying the face of the speaker and 2D vectors of the speaker's body
posture. Our implementation choices were guided by the aim to develop a model
that could be deployed on social robots and be efficient in ecological
scenarios. We demonstrate that our model is able to solve the Addressee
Estimation problem in terms of addressee localisation in space, from a robot
ego-centric point of view.Comment: Accepted version of a paper published at 2023 International Joint
Conference on Neural Networks (IJCNN). Please find the published version and
info to cite the paper at https://doi.org/10.1109/IJCNN54540.2023.10191452 .
10 pages, 8 Figures, 3 Table