Real-world aerial scene understanding is limited by a lack of datasets that
contain densely annotated images curated under a diverse set of conditions. Due
to inherent challenges in obtaining such images in controlled real-world
settings, we present SkyScenes, a synthetic dataset of densely annotated aerial
images captured from Unmanned Aerial Vehicle (UAV) perspectives. We carefully
curate SkyScenes images from CARLA to comprehensively capture diversity across
layout (urban and rural maps), weather conditions, times of day, pitch angles
and altitudes with corresponding semantic, instance and depth annotations.
Through our experiments using SkyScenes, we show that (1) Models trained on
SkyScenes generalize well to different real-world scenarios, (2) augmenting
training on real images with SkyScenes data can improve real-world performance,
(3) controlled variations in SkyScenes can offer insights into how models
respond to changes in viewpoint conditions, and (4) incorporating additional
sensor modalities (depth) can improve aerial scene understanding