192 research outputs found
VMamba: Visual State Space Model
Designing computationally efficient network architectures persists as an
ongoing necessity in computer vision. In this paper, we transplant Mamba, a
state-space language model, into VMamba, a vision backbone that works in linear
time complexity. At the core of VMamba lies a stack of Visual State-Space (VSS)
blocks with the 2D Selective Scan (SS2D) module. By traversing along four
scanning routes, SS2D helps bridge the gap between the ordered nature of 1D
selective scan and the non-sequential structure of 2D vision data, which
facilitates the gathering of contextual information from various sources and
perspectives. Based on the VSS blocks, we develop a family of VMamba
architectures and accelerate them through a succession of architectural and
implementation enhancements. Extensive experiments showcase VMamba's promising
performance across diverse visual perception tasks, highlighting its advantages
in input scaling efficiency compared to existing benchmark models. Source code
is available at https://github.com/MzeroMiko/VMamba.Comment: 25 pages, 14 figures, 15 table
Multi-Scale VMamba: Hierarchy in Hierarchy Visual State Space Model
Despite the significant achievements of Vision Transformers (ViTs) in various
vision tasks, they are constrained by the quadratic complexity. Recently, State
Space Models (SSMs) have garnered widespread attention due to their global
receptive field and linear complexity with respect to the input length,
demonstrating substantial potential across fields including natural language
processing and computer vision. To improve the performance of SSMs in vision
tasks, a multi-scan strategy is widely adopted, which leads to significant
redundancy of SSMs. For a better trade-off between efficiency and performance,
we analyze the underlying reasons behind the success of the multi-scan
strategy, where long-range dependency plays an important role. Based on the
analysis, we introduce Multi-Scale Vision Mamba (MSVMamba) to preserve the
superiority of SSMs in vision tasks with limited parameters. It employs a
multi-scale 2D scanning technique on both original and downsampled feature
maps, which not only benefits long-range dependency learning but also reduces
computational costs. Additionally, we integrate a Convolutional Feed-Forward
Network (ConvFFN) to address the lack of channel mixing. Our experiments
demonstrate that MSVMamba is highly competitive, with the MSVMamba-Tiny model
achieving 82.8% top-1 accuracy on ImageNet, 46.9% box mAP, and 42.2% instance
mAP with the Mask R-CNN framework, 1x training schedule on COCO, and 47.6% mIoU
with single-scale testing on ADE20K.Code is available at
\url{https://github.com/YuHengsss/MSVMamba}
Swin-UMamba: Mamba-based UNet with ImageNet-based pretraining
Accurate medical image segmentation demands the integration of multi-scale
information, spanning from local features to global dependencies. However, it
is challenging for existing methods to model long-range global information,
where convolutional neural networks (CNNs) are constrained by their local
receptive fields, and vision transformers (ViTs) suffer from high quadratic
complexity of their attention mechanism. Recently, Mamba-based models have
gained great attention for their impressive ability in long sequence modeling.
Several studies have demonstrated that these models can outperform popular
vision models in various tasks, offering higher accuracy, lower memory
consumption, and less computational burden. However, existing Mamba-based
models are mostly trained from scratch and do not explore the power of
pretraining, which has been proven to be quite effective for data-efficient
medical image analysis. This paper introduces a novel Mamba-based model,
Swin-UMamba, designed specifically for medical image segmentation tasks,
leveraging the advantages of ImageNet-based pretraining. Our experimental
results reveal the vital role of ImageNet-based training in enhancing the
performance of Mamba-based models. Swin-UMamba demonstrates superior
performance with a large margin compared to CNNs, ViTs, and latest Mamba-based
models. Notably, on AbdomenMRI, Encoscopy, and Microscopy datasets, Swin-UMamba
outperforms its closest counterpart U-Mamba_Enc by an average score of 2.72%.Comment: Code and models of Swin-UMamba are publicly available at:
https://github.com/JiarunLiu/Swin-UMamb
Mamba meets crack segmentation
Cracks pose safety risks to infrastructure and cannot be overlooked. The
prevailing structures in existing crack segmentation networks predominantly
consist of CNNs or Transformers. However, CNNs exhibit a deficiency in global
modeling capability, hindering the representation to entire crack features.
Transformers can capture long-range dependencies but suffer from high and
quadratic complexity. Recently, Mamba has garnered extensive attention due to
its linear spatial and computational complexity and its powerful global
perception. This study explores the representation capabilities of Mamba to
crack features. Specifically, this paper uncovers the connection between Mamba
and the attention mechanism, providing a profound insight, an attention
perspective, into interpreting Mamba and devising a novel Mamba module
following the principles of attention blocks, namely CrackMamba. We compare
CrackMamba with the most prominent visual Mamba modules, Vim and Vmamba, on two
datasets comprising asphalt pavement and concrete pavement cracks, and steel
cracks, respectively. The quantitative results show that CrackMamba stands out
as the sole Mamba block consistently enhancing the baseline model's performance
across all evaluation measures, while reducing its parameters and computational
costs. Moreover, this paper substantiates that Mamba can achieve global
receptive fields through both theoretical analysis and visual interpretability.
The discoveries of this study offer a dual contribution. First, as a
plug-and-play and simple yet effective Mamba module, CrackMamba exhibits
immense potential for integration into various crack segmentation models.
Second, the proposed innovative Mamba design concept, integrating Mamba with
the attention mechanism, holds significant reference value for all Mamba-based
computer vision models, not limited to crack segmentation networks, as
investigated in this study.Comment: 32 pages, 8 figures. Preprint submitted to Elsevie
VM-UNET-V2 Rethinking Vision Mamba UNet for Medical Image Segmentation
In the field of medical image segmentation, models based on both CNN and
Transformer have been thoroughly investigated. However, CNNs have limited
modeling capabilities for long-range dependencies, making it challenging to
exploit the semantic information within images fully. On the other hand, the
quadratic computational complexity poses a challenge for Transformers.
Recently, State Space Models (SSMs), such as Mamba, have been recognized as a
promising method. They not only demonstrate superior performance in modeling
long-range interactions, but also preserve a linear computational complexity.
Inspired by the Mamba architecture, We proposed Vison Mamba-UNetV2, the Visual
State Space (VSS) Block is introduced to capture extensive contextual
information, the Semantics and Detail Infusion (SDI) is introduced to augment
the infusion of low-level and high-level features. We conduct comprehensive
experiments on the ISIC17, ISIC18, CVC-300, CVC-ClinicDB, Kvasir, CVC-ColonDB
and ETIS-LaribPolypDB public datasets. The results indicate that VM-UNetV2
exhibits competitive performance in medical image segmentation tasks. Our code
is available at https://github.com/nobodyplayer1/VM-UNetV2.Comment: 12 pages, 4 figure
A Survey on Visual Mamba
State space models (SSM) with selection mechanisms and hardware-aware architectures, namely Mamba, have recently shown significant potential in long-sequence modeling. Since the complexity of transformers’ self-attention mechanism is quadratic with image size, as well as increasing computational demands, researchers are currently exploring how to adapt Mamba for computer vision tasks. This paper is the first comprehensive survey that aims to provide an in-depth analysis of Mamba models within the domain of computer vision. It begins by exploring the foundational concepts contributing to Mamba’s success, including the SSM framework, selection mechanisms, and hardware-aware design. Then, we review these vision Mamba models by categorizing them into foundational models and those enhanced with techniques including convolution, recurrence, and attention to improve their sophistication. Furthermore, we investigate the widespread applications of Mamba in vision tasks, which include their use as a backbone in various levels of vision processing. This encompasses general visual tasks, medical visual tasks (e.g., 2D/3D segmentation, classification, image registration, etc.), and remote sensing visual tasks. In particular, we introduce general visual tasks from two levels: high/mid-level vision (e.g., object detection, segmentation, video classification, etc.) and low-level vision (e.g., image super-resolution, image restoration, visual generation, etc.). We hope this endeavor will spark additional interest within the community to address current challenges and further apply Mamba models in computer vision
VMambaCC: A Visual State Space Model for Crowd Counting
As a deep learning model, Visual Mamba (VMamba) has a low computational
complexity and a global receptive field, which has been successful applied to
image classification and detection. To extend its applications, we apply VMamba
to crowd counting and propose a novel VMambaCC (VMamba Crowd Counting) model.
Naturally, VMambaCC inherits the merits of VMamba, or global modeling for
images and low computational cost. Additionally, we design a Multi-head
High-level Feature (MHF) attention mechanism for VMambaCC. MHF is a new
attention mechanism that leverages high-level semantic features to augment
low-level semantic features, thereby enhancing spatial feature representation
with greater precision. Building upon MHF, we further present a High-level
Semantic Supervised Feature Pyramid Network (HS2PFN) that progressively
integrates and enhances high-level semantic information with low-level semantic
information. Extensive experimental results on five public datasets validate
the efficacy of our approach. For example, our method achieves a mean absolute
error of 51.87 and a mean squared error of 81.3 on the ShangHaiTech\_PartA
dataset. Our code is coming soon
A Survey on Visual Mamba
State space models (SSMs) with selection mechanisms and hardware-aware
architectures, namely Mamba, have recently demonstrated significant promise in
long-sequence modeling. Since the self-attention mechanism in transformers has
quadratic complexity with image size and increasing computational demands, the
researchers are now exploring how to adapt Mamba for computer vision tasks.
This paper is the first comprehensive survey aiming to provide an in-depth
analysis of Mamba models in the field of computer vision. It begins by
exploring the foundational concepts contributing to Mamba's success, including
the state space model framework, selection mechanisms, and hardware-aware
design. Next, we review these vision mamba models by categorizing them into
foundational ones and enhancing them with techniques such as convolution,
recurrence, and attention to improve their sophistication. We further delve
into the widespread applications of Mamba in vision tasks, which include their
use as a backbone in various levels of vision processing. This encompasses
general visual tasks, Medical visual tasks (e.g., 2D / 3D segmentation,
classification, and image registration, etc.), and Remote Sensing visual tasks.
We specially introduce general visual tasks from two levels: High/Mid-level
vision (e.g., Object detection, Segmentation, Video classification, etc.) and
Low-level vision (e.g., Image super-resolution, Image restoration, Visual
generation, etc.). We hope this endeavor will spark additional interest within
the community to address current challenges and further apply Mamba models in
computer vision
VMambaMorph: a Multi-Modality Deformable Image Registration Framework based on Visual State Space Model with Cross-Scan Module
Image registration, a critical process in medical imaging, involves aligning
different sets of medical imaging data into a single unified coordinate system.
Deep learning networks, such as the Convolutional Neural Network (CNN)-based
VoxelMorph, Vision Transformer (ViT)-based TransMorph, and State Space Model
(SSM)-based MambaMorph, have demonstrated effective performance in this domain.
The recent Visual State Space Model (VMamba), which incorporates a cross-scan
module with SSM, has exhibited promising improvements in modeling global-range
dependencies with efficient computational cost in computer vision tasks. This
paper hereby introduces an exploration of VMamba with image registration, named
VMambaMorph. This novel hybrid VMamba-CNN network is designed specifically for
3D image registration. Utilizing a U-shaped network architecture, VMambaMorph
computes the deformation field based on target and source volumes. The
VMamba-based block with 2D cross-scan module is redesigned for 3D volumetric
feature processing. To overcome the complex motion and structure on
multi-modality images, we further propose a fine-tune recursive registration
framework. We validate VMambaMorph using a public benchmark brain MR-CT
registration dataset, comparing its performance against current
state-of-the-art methods. The results indicate that VMambaMorph achieves
competitive registration quality. The code for VMambaMorph with all baseline
methods is available on GitHub
MambaVision: A Hybrid Mamba-Transformer Vision Backbone
We propose a novel hybrid Mamba-Transformer backbone, denoted as MambaVision,
which is specifically tailored for vision applications. Our core contribution
includes redesigning the Mamba formulation to enhance its capability for
efficient modeling of visual features. In addition, we conduct a comprehensive
ablation study on the feasibility of integrating Vision Transformers (ViT) with
Mamba. Our results demonstrate that equipping the Mamba architecture with
several self-attention blocks at the final layers greatly improves the modeling
capacity to capture long-range spatial dependencies. Based on our findings, we
introduce a family of MambaVision models with a hierarchical architecture to
meet various design criteria. For Image classification on ImageNet-1K dataset,
MambaVision model variants achieve a new State-of-the-Art (SOTA) performance in
terms of Top-1 accuracy and image throughput. In downstream tasks such as
object detection, instance segmentation and semantic segmentation on MS COCO
and ADE20K datasets, MambaVision outperforms comparably-sized backbones and
demonstrates more favorable performance. Code:
https://github.com/NVlabs/MambaVision.Comment: Tech. repor
- …
