Whole slide image (WSI) refers to a type of high-resolution scanned tissue
image, which is extensively employed in computer-assisted diagnosis (CAD). The
extremely high resolution and limited availability of region-level annotations
make it challenging to employ deep learning methods for WSI-based digital
diagnosis. Multiple instance learning (MIL) is a powerful tool to address the
weak annotation problem, while Transformer has shown great success in the field
of visual tasks. The combination of both should provide new insights for deep
learning based image diagnosis. However, due to the limitations of single-level
MIL and the attention mechanism's constraints on sequence length, directly
applying Transformer to WSI-based MIL tasks is not practical. To tackle this
issue, we propose a Multi-level MIL with Transformer (MMIL-Transformer)
approach. By introducing a hierarchical structure to MIL, this approach enables
efficient handling of MIL tasks that involve a large number of instances. To
validate its effectiveness, we conducted a set of experiments on WSIs
classification task, where MMIL-Transformer demonstrate superior performance
compared to existing state-of-the-art methods. Our proposed approach achieves
test AUC 94.74% and test accuracy 93.41% on CAMELYON16 dataset, test AUC 99.04%
and test accuracy 94.37% on TCGA-NSCLC dataset, respectively. All code and
pre-trained models are available at: https://github.com/hustvl/MMIL-Transforme