Current transformer-based change detection (CD) approaches either employ a
pre-trained model trained on large-scale image classification ImageNet dataset
or rely on first pre-training on another CD dataset and then fine-tuning on the
target benchmark. This current strategy is driven by the fact that transformers
typically require a large amount of training data to learn inductive biases,
which is insufficient in standard CD datasets due to their small size. We
develop an end-to-end CD approach with transformers that is trained from
scratch and yet achieves state-of-the-art performance on four public
benchmarks. Instead of using conventional self-attention that struggles to
capture inductive biases when trained from scratch, our architecture utilizes a
shuffled sparse-attention operation that focuses on selected sparse informative
regions to capture the inherent characteristics of the CD data. Moreover, we
introduce a change-enhanced feature fusion (CEFF) module to fuse the features
from input image pairs by performing a per-channel re-weighting. Our CEFF
module aids in enhancing the relevant semantic changes while suppressing the
noisy ones. Extensive experiments on four CD datasets reveal the merits of the
proposed contributions, achieving gains as high as 14.27\% in
intersection-over-union (IoU) score, compared to the best-published results in
the literature. Code is available at
\url{https://github.com/mustansarfiaz/ScratchFormer}.Comment: 5 figures and 4 table