Image-based multi-person reconstruction in wide-field large scenes is
critical for crowd analysis and security alert. However, existing methods
cannot deal with large scenes containing hundreds of people, which encounter
the challenges of large number of people, large variations in human scale, and
complex spatial distribution. In this paper, we propose Crowd3D, the first
framework to reconstruct the 3D poses, shapes and locations of hundreds of
people with global consistency from a single large-scene image. The core of our
approach is to convert the problem of complex crowd localization into pixel
localization with the help of our newly defined concept, Human-scene Virtual
Interaction Point (HVIP). To reconstruct the crowd with global consistency, we
propose a progressive reconstruction network based on HVIP by pre-estimating a
scene-level camera and a ground plane. To deal with a large number of persons
and various human sizes, we also design an adaptive human-centric cropping
scheme. Besides, we contribute a benchmark dataset, LargeCrowd, for crowd
reconstruction in a large scene. Experimental results demonstrate the
effectiveness of the proposed method. The code and datasets will be made
public.Comment: 8 pages (not including reference