16 research outputs found
SPFormer: Enhancing Vision Transformer with Superpixel Representation
In this work, we introduce SPFormer, a novel Vision Transformer enhanced by
superpixel representation. Addressing the limitations of traditional Vision
Transformers' fixed-size, non-adaptive patch partitioning, SPFormer employs
superpixels that adapt to the image's content. This approach divides the image
into irregular, semantically coherent regions, effectively capturing intricate
details and applicable at both initial and intermediate feature levels.
SPFormer, trainable end-to-end, exhibits superior performance across various
benchmarks. Notably, it exhibits significant improvements on the challenging
ImageNet benchmark, achieving a 1.4% increase over DeiT-T and 1.1% over DeiT-S
respectively. A standout feature of SPFormer is its inherent explainability.
The superpixel structure offers a window into the model's internal processes,
providing valuable insights that enhance the model's interpretability. This
level of clarity significantly improves SPFormer's robustness, particularly in
challenging scenarios such as image rotations and occlusions, demonstrating its
adaptability and resilience
3D-Aware Neural Body Fitting for Occlusion Robust 3D Human Pose Estimation
Regression-based methods for 3D human pose estimation directly predict the 3D
pose parameters from a 2D image using deep networks. While achieving
state-of-the-art performance on standard benchmarks, their performance degrades
under occlusion. In contrast, optimization-based methods fit a parametric body
model to 2D features in an iterative manner. The localized reconstruction loss
can potentially make them robust to occlusion, but they suffer from the 2D-3D
ambiguity.
Motivated by the recent success of generative models in rigid object pose
estimation, we propose 3D-aware Neural Body Fitting (3DNBF) - an approximate
analysis-by-synthesis approach to 3D human pose estimation with SOTA
performance and occlusion robustness. In particular, we propose a generative
model of deep features based on a volumetric human representation with Gaussian
ellipsoidal kernels emitting 3D pose-dependent feature vectors. The neural
features are trained with contrastive learning to become 3D-aware and hence to
overcome the 2D-3D ambiguity.
Experiments show that 3DNBF outperforms other approaches on both occluded and
standard benchmarks. Code is available at https://github.com/edz-o/3DNBFComment: ICCV 2023, project page: https://3dnbf.github.io
In Defense of Image Pre-Training for Spatiotemporal Recognition
Image pre-training, the current de-facto paradigm for a wide range of visual
tasks, is generally less favored in the field of video recognition. By
contrast, a common strategy is to directly train with spatiotemporal
convolutional neural networks (CNNs) from scratch. Nonetheless, interestingly,
by taking a closer look at these from-scratch learned CNNs, we note there exist
certain 3D kernels that exhibit much stronger appearance modeling ability than
others, arguably suggesting appearance information is already well disentangled
in learning. Inspired by this observation, we hypothesize that the key to
effectively leveraging image pre-training lies in the decomposition of learning
spatial and temporal features, and revisiting image pre-training as the
appearance prior to initializing 3D kernels. In addition, we propose
Spatial-Temporal Separable (STS) convolution, which explicitly splits the
feature channels into spatial and temporal groups, to further enable a more
thorough decomposition of spatiotemporal features for fine-tuning 3D CNNs. Our
experiments show that simply replacing 3D convolution with STS notably improves
a wide range of 3D CNNs without increasing parameters and computation on both
Kinetics-400 and Something-Something V2. Moreover, this new training pipeline
consistently achieves better results on video recognition with significant
speedup. For instance, we achieve +0.6% top-1 of Slowfast on Kinetics-400 over
the strong 256-epoch 128-GPU baseline while fine-tuning for only 50 epochs with
4 GPUs. The code and models are available at
https://github.com/UCSC-VLAA/Image-Pretraining-for-Video.Comment: Published as a conference paper at ECCV 202
FedConv: Enhancing Convolutional Neural Networks for Handling Data Heterogeneity in Federated Learning
Federated learning (FL) is an emerging paradigm in machine learning, where a
shared model is collaboratively learned using data from multiple devices to
mitigate the risk of data leakage. While recent studies posit that Vision
Transformer (ViT) outperforms Convolutional Neural Networks (CNNs) in
addressing data heterogeneity in FL, the specific architectural components that
underpin this advantage have yet to be elucidated. In this paper, we
systematically investigate the impact of different architectural elements, such
as activation functions and normalization layers, on the performance within
heterogeneous FL. Through rigorous empirical analyses, we are able to offer the
first-of-its-kind general guidance on micro-architecture design principles for
heterogeneous FL.
Intriguingly, our findings indicate that with strategic architectural
modifications, pure CNNs can achieve a level of robustness that either matches
or even exceeds that of ViTs when handling heterogeneous data clients in FL.
Additionally, our approach is compatible with existing FL techniques and
delivers state-of-the-art solutions across a broad spectrum of FL benchmarks.
The code is publicly available at https://github.com/UCSC-VLAA/FedConvComment: 9 pages, 6 figures. Equal contribution by P. Xu and Z. Wan
SwinMM: Masked Multi-view with Swin Transformers for 3D Medical Image Segmentation
Recent advancements in large-scale Vision Transformers have made significant
strides in improving pre-trained models for medical image segmentation.
However, these methods face a notable challenge in acquiring a substantial
amount of pre-training data, particularly within the medical field. To address
this limitation, we present Masked Multi-view with Swin Transformers (SwinMM),
a novel multi-view pipeline for enabling accurate and data-efficient
self-supervised medical image analysis. Our strategy harnesses the potential of
multi-view information by incorporating two principal components. In the
pre-training phase, we deploy a masked multi-view encoder devised to
concurrently train masked multi-view observations through a range of diverse
proxy tasks. These tasks span image reconstruction, rotation, contrastive
learning, and a novel task that employs a mutual learning paradigm. This new
task capitalizes on the consistency between predictions from various
perspectives, enabling the extraction of hidden multi-view information from 3D
medical data. In the fine-tuning stage, a cross-view decoder is developed to
aggregate the multi-view information through a cross-attention block. Compared
with the previous state-of-the-art self-supervised learning method Swin UNETR,
SwinMM demonstrates a notable advantage on several medical image segmentation
tasks. It allows for a smooth integration of multi-view information,
significantly boosting both the accuracy and data-efficiency of the model. Code
and models are available at https://github.com/UCSC-VLAA/SwinMM/.Comment: MICCAI 2023; project page: https://github.com/UCSC-VLAA/SwinMM
TOWARDS NETWORKS WITH EFFICIENCY, EXPLAINABILITY AND ROBUSTNESS
Over the past decades, the field of Computer Vision has experienced remarkable success, largely attributed to the evolution of various underlying network architectures. However, the real-world deployment of these visual recognition systems has surfaced several challenges. Key among them is achieving operational efficiency, particularly in terms of computational cost. Additionally, there is a pressing need to move beyond black-box models towards systems that are explainable and capable of error rectification. Moreover, ensuring robustness against unforeseen scenarios and malicious attacks is crucial.
This dissertation focuses on addressing these critical aspects of network architecture. The first part delves into enhancing the efficiency of Convolutional Neural Networks, focusing on refining architectural building blocks and designing structures that leverage data characteristics. The second section examines network robustness, particularly through the strategic use of adversarial examples to enhance network performance and resilience. The final part demonstrates the integration of superpixel representation with transformers, synergistically combining efficiency, explainability, and robustness
The Double-Leucine Motifs Affect Internalization, Stability, and Function of Organic Anion Transporting Polypeptide 1B1
Organic anion transporting polypeptide 1B1 (OATP1B1) is specifically expressed at the basolateral membrane of human hepatocytes and plays important roles in the uptake of various endogenous and exogenous compounds including many drugs. The proper functioning of OATP1B1, hence, is essential for the bioavailability of various therapeutic agents and needs to be tightly regulated. Dileucine-based signals are involved in lysosomal targeting, internalization, and trans-Golgi network to endosome transporting of membrane proteins. In the current study, we analyzed the 3 intracellular and 13 transmembrane dileucine motifs (DLMs) within the sequence of OATP1B1. It was found that the simultaneous replacement of I332 and L333 with alanine resulted in a significantly reduced level of the mature form of OATP1B1. The cell surface expression of I332A/L333A could be partially rescued by MG132, as well as agents that prevent clathrin-dependent protein internalization, suggesting that this dileucine motif may be involved in the endocytosis of OATP1B1. On the other hand, I376/L377 and I642/L643, which are localized at transmembrane helices (TM) 8 and 12, respectively, are involved in the interaction of the transporter with its substrates. I642A/L643A exhibited a significantly decreased protein level compared to that of the wild-type, implying that the motif is important for maintaining the stability of OATP1B1 as well