Accurate and reliable 3D detection is vital for many applications including
autonomous driving vehicles and service robots. In this paper, we present a
flexible and high-performance 3D detection framework, named MPPNet, for 3D
temporal object detection with point cloud sequences. We propose a novel
three-hierarchy framework with proxy points for multi-frame feature encoding
and interactions to achieve better detection. The three hierarchies conduct
per-frame feature encoding, short-clip feature fusion, and whole-sequence
feature aggregation, respectively. To enable processing long-sequence point
clouds with reasonable computational resources, intra-group feature mixing and
inter-group feature attention are proposed to form the second and third feature
encoding hierarchies, which are recurrently applied for aggregating multi-frame
trajectory features. The proxy points not only act as consistent object
representations for each frame, but also serve as the courier to facilitate
feature interaction between frames. The experiments on large Waymo Open dataset
show that our approach outperforms state-of-the-art methods with large margins
when applied to both short (e.g., 4-frame) and long (e.g., 16-frame) point
cloud sequences. Code is available at https://github.com/open-mmlab/OpenPCDet.Comment: Accepted by ECCV 202