Deep learning applied to Earth observation (EO) yields impressive results. However, a significant challenge in EO is the rapidly increasing data volume while limited annotation resources available. Self-supervised representation learning (SSL) employs large amounts of unlabeled data. Recently, Masked Image Modelling (MIM) demonstrated scalability in model and data size. MIM masks a defined ratio of the input image for training a model to predict the masked patches. The learnt encoder is transferred to downstream tasks. In this work, we explore a new approach of MIM for EO combining two state-of-the-art SSL methodologies. One employs the Masked Autoencoder (MAE), which asymmetrically masks and reconstructs the raw input with the aid of an encoder operating on the visible patches followed by a smaller decoder reconstructing. The second methodology utilizes the Masked Feature Prediction (MFP), where image feature descriptors get reconstructed. We test our approach on the SSL4E0-S12 dataset reconstructing Histogram Oriented Gradients (HOG).
We evaluate the pre-trained model on a multi-class classification for EuroSAT. Experimental results indicate stable performance with more than 90% accuracy down to 10% of labeled data. An ablation study on data normalization reveals that linear classification downstream task accuracy benefits from normalization by up to 6%. In contrast, fine tuning accuracies are robust to data normalization