Birds-eye-view (BEV) semantic segmentation is critical for autonomous driving
for its powerful spatial representation ability. It is challenging to estimate
the BEV semantic maps from monocular images due to the spatial gap, since it is
implicitly required to realize both the perspective-to-BEV transformation and
segmentation. We present a novel two-stage Geometry Prior-based Transformation
framework named GitNet, consisting of (i) the geometry-guided pre-alignment and
(ii) ray-based transformer. In the first stage, we decouple the BEV
segmentation into the perspective image segmentation and geometric prior-based
mapping, with explicit supervision by projecting the BEV semantic labels onto
the image plane to learn visibility-aware features and learnable geometry to
translate into BEV space. Second, the pre-aligned coarse BEV features are
further deformed by ray-based transformers to take visibility knowledge into
account. GitNet achieves the leading performance on the challenging nuScenes
and Argoverse Datasets. The code will be publicly available