Modeling spatiotemporal brain dynamics from high-dimensional data, such as
functional Magnetic Resonance Imaging (fMRI), is a formidable task in
neuroscience. Existing approaches for fMRI analysis utilize hand-crafted
features, but the process of feature extraction risks losing essential
information in fMRI scans. To address this challenge, we present SwiFT (Swin 4D
fMRI Transformer), a Swin Transformer architecture that can learn brain
dynamics directly from fMRI volumes in a memory and computation-efficient
manner. SwiFT achieves this by implementing a 4D window multi-head
self-attention mechanism and absolute positional embeddings. We evaluate SwiFT
using multiple large-scale resting-state fMRI datasets, including the Human
Connectome Project (HCP), Adolescent Brain Cognitive Development (ABCD), and UK
Biobank (UKB) datasets, to predict sex, age, and cognitive intelligence. Our
experimental outcomes reveal that SwiFT consistently outperforms recent
state-of-the-art models. Furthermore, by leveraging its end-to-end learning
capability, we show that contrastive loss-based self-supervised pre-training of
SwiFT can enhance performance on downstream tasks. Additionally, we employ an
explainable AI method to identify the brain regions associated with sex
classification. To our knowledge, SwiFT is the first Swin Transformer
architecture to process dimensional spatiotemporal brain functional data in an
end-to-end fashion. Our work holds substantial potential in facilitating
scalable learning of functional brain imaging in neuroscience research by
reducing the hurdles associated with applying Transformer models to
high-dimensional fMRI.Comment: NeurIPS 202