Large, pretrained models are commonly finetuned with imagery that is heavily
augmented to mimic different conditions and scales, with the resulting models
used for various tasks with imagery from a range of spatial scales. Such models
overlook scale-specific information in the data for scale-dependent domains,
such as remote sensing. In this paper, we present Scale-MAE, a pretraining
method that explicitly learns relationships between data at different, known
scales throughout the pretraining process. Scale-MAE pretrains a network by
masking an input image at a known input scale, where the area of the Earth
covered by the image determines the scale of the ViT positional encoding, not
the image resolution. Scale-MAE encodes the masked image with a standard ViT
backbone, and then decodes the masked image through a bandpass filter to
reconstruct low/high frequency images at lower/higher scales. We find that
tasking the network with reconstructing both low/high frequency images leads to
robust multiscale representations for remote sensing imagery. Scale-MAE
achieves an average of a 2.4−5.6% non-parametric kNN classification
improvement across eight remote sensing datasets compared to current
state-of-the-art and obtains a 0.9 mIoU to 1.7 mIoU improvement on the
SpaceNet building segmentation transfer task for a range of evaluation scales