Multi-person motion prediction is a challenging task, especially for
real-world scenarios of densely interacted persons. Most previous works have
been devoted to studying the case of weak interactions (e.g., hand-shaking),
which typically forecast each human pose in isolation. In this paper, we focus
on motion prediction for multiple persons with extreme collaborations and
attempt to explore the relationships between the highly interactive persons'
motion trajectories. Specifically, a novel cross-query attention (XQA) module
is proposed to bilaterally learn the cross-dependencies between the two pose
sequences tailored for this situation. Additionally, we introduce and build a
proxy entity to bridge the involved persons, which cooperates with our proposed
XQA module and subtly controls the bidirectional information flows, acting as a
motion intermediary. We then adapt these designs to a Transformer-based
architecture and devise a simple yet effective end-to-end framework called
proxy-bridged game Transformer (PGformer) for multi-person interactive motion
prediction. The effectiveness of our method has been evaluated on the
challenging ExPI dataset, which involves highly interactive actions. We show
that our PGformer consistently outperforms the state-of-the-art methods in both
short- and long-term predictions by a large margin. Besides, our approach can
also be compatible with the weakly interacted CMU-Mocap and MuPoTS-3D datasets
and achieve encouraging results. Our code will become publicly available upon
acceptance