A robot needs contextual awareness, effective speech production and
complementing non-verbal gestures for successful communication in society. In
this paper, we present our end-to-end system that tries to enhance the
effectiveness of non-verbal gestures. For achieving this, we identified
prominently used gestures in performances by TED speakers and mapped them to
their corresponding speech context and modulated speech based upon the
attention of the listener. The proposed method utilized Convolutional Pose
Machine [4] to detect the human gesture. Dominant gestures of TED speakers were
used for learning the gesture-to-speech mapping. The speeches by them were used
for training the model. We also evaluated the engagement of the robot with
people by conducting a social survey. The effectiveness of the performance was
monitored by the robot and it self-improvised its speech pattern on the basis
of the attention level of the audience, which was calculated using visual
feedback from the camera. The effectiveness of interaction as well as the
decisions made during improvisation was further evaluated based on the
head-pose detection and interaction survey.Comment: 8 pages, 9 figures, Under review in IRC 201