We present PolyBuilding, a fully end-to-end polygon Transformer for building
extraction. PolyBuilding direct predicts vector representation of buildings
from remote sensing images. It builds upon an encoder-decoder transformer
architecture and simultaneously outputs building bounding boxes and polygons.
Given a set of polygon queries, the model learns the relations among them and
encodes context information from the image to predict the final set of building
polygons with fixed vertex numbers. Corner classification is performed to
distinguish the building corners from the sampled points, which can be used to
remove redundant vertices along the building walls during inference. A 1-d
non-maximum suppression (NMS) is further applied to reduce vertex redundancy
near the building corners. With the refinement operations, polygons with
regular shapes and low complexity can be effectively obtained. Comprehensive
experiments are conducted on the CrowdAI dataset. Quantitative and qualitative
results show that our approach outperforms prior polygonal building extraction
methods by a large margin. It also achieves a new state-of-the-art in terms of
pixel-level coverage, instance-level precision and recall, and geometry-level
properties (including contour regularity and polygon complexity)