Speaker extraction seeks to extract the clean speech of a target speaker from
a multi-talker mixture speech. There have been studies to use a pre-recorded
speech sample or face image of the target speaker as the speaker cue. In human
communication, co-speech gestures that are naturally timed with speech also
contribute to speech perception. In this work, we explore the use of co-speech
gestures sequence, e.g. hand and body movements, as the speaker cue for speaker
extraction, which could be easily obtained from low-resolution video
recordings, thus more available than face recordings. We propose two networks
using the co-speech gestures cue to perform attentive listening on the target
speaker, one that implicitly fuses the co-speech gestures cue in the speaker
extraction process, the other performs speech separation first, followed by
explicitly using the co-speech gestures cue to associate a separated speech to
the target speaker. The experimental results show that the co-speech gestures
cue is informative in associating the target speaker, and the quality of the
extracted speech shows significant improvements over the unprocessed mixture
speech