Like masked language modeling (MLM) in natural language processing, masked
image modeling (MIM) aims to extract valuable insights from image patches to
enhance the feature extraction capabilities of the underlying deep neural
network (DNN). Contrasted with other training paradigms like supervised
learning and unsupervised contrastive learning, masked image modeling (MIM)
pretraining typically demands significant computational resources in order to
manage large training data batches (e.g., 4096). The significant memory and
computation requirements pose a considerable challenge to its broad adoption.
To mitigate this, we introduce a novel learning framework,
termed~\textit{Block-Wise Masked Image Modeling} (BIM). This framework involves
decomposing the MIM tasks into several sub-tasks with independent computation
patterns, resulting in block-wise back-propagation operations instead of the
traditional end-to-end approach. Our proposed BIM maintains superior
performance compared to conventional MIM while greatly reducing peak memory
consumption. Moreover, BIM naturally enables the concurrent training of
numerous DNN backbones of varying depths. This leads to the creation of
multiple trained DNN backbones, each tailored to different hardware platforms
with distinct computing capabilities. This approach significantly reduces
computational costs in comparison with training each DNN backbone individually.
Our framework offers a promising solution for resource constrained training of
MIM