16 research outputs found

    SPFormer: Enhancing Vision Transformer with Superpixel Representation

    Full text link
    In this work, we introduce SPFormer, a novel Vision Transformer enhanced by superpixel representation. Addressing the limitations of traditional Vision Transformers' fixed-size, non-adaptive patch partitioning, SPFormer employs superpixels that adapt to the image's content. This approach divides the image into irregular, semantically coherent regions, effectively capturing intricate details and applicable at both initial and intermediate feature levels. SPFormer, trainable end-to-end, exhibits superior performance across various benchmarks. Notably, it exhibits significant improvements on the challenging ImageNet benchmark, achieving a 1.4% increase over DeiT-T and 1.1% over DeiT-S respectively. A standout feature of SPFormer is its inherent explainability. The superpixel structure offers a window into the model's internal processes, providing valuable insights that enhance the model's interpretability. This level of clarity significantly improves SPFormer's robustness, particularly in challenging scenarios such as image rotations and occlusions, demonstrating its adaptability and resilience

    3D-Aware Neural Body Fitting for Occlusion Robust 3D Human Pose Estimation

    Full text link
    Regression-based methods for 3D human pose estimation directly predict the 3D pose parameters from a 2D image using deep networks. While achieving state-of-the-art performance on standard benchmarks, their performance degrades under occlusion. In contrast, optimization-based methods fit a parametric body model to 2D features in an iterative manner. The localized reconstruction loss can potentially make them robust to occlusion, but they suffer from the 2D-3D ambiguity. Motivated by the recent success of generative models in rigid object pose estimation, we propose 3D-aware Neural Body Fitting (3DNBF) - an approximate analysis-by-synthesis approach to 3D human pose estimation with SOTA performance and occlusion robustness. In particular, we propose a generative model of deep features based on a volumetric human representation with Gaussian ellipsoidal kernels emitting 3D pose-dependent feature vectors. The neural features are trained with contrastive learning to become 3D-aware and hence to overcome the 2D-3D ambiguity. Experiments show that 3DNBF outperforms other approaches on both occluded and standard benchmarks. Code is available at https://github.com/edz-o/3DNBFComment: ICCV 2023, project page: https://3dnbf.github.io

    In Defense of Image Pre-Training for Spatiotemporal Recognition

    Full text link
    Image pre-training, the current de-facto paradigm for a wide range of visual tasks, is generally less favored in the field of video recognition. By contrast, a common strategy is to directly train with spatiotemporal convolutional neural networks (CNNs) from scratch. Nonetheless, interestingly, by taking a closer look at these from-scratch learned CNNs, we note there exist certain 3D kernels that exhibit much stronger appearance modeling ability than others, arguably suggesting appearance information is already well disentangled in learning. Inspired by this observation, we hypothesize that the key to effectively leveraging image pre-training lies in the decomposition of learning spatial and temporal features, and revisiting image pre-training as the appearance prior to initializing 3D kernels. In addition, we propose Spatial-Temporal Separable (STS) convolution, which explicitly splits the feature channels into spatial and temporal groups, to further enable a more thorough decomposition of spatiotemporal features for fine-tuning 3D CNNs. Our experiments show that simply replacing 3D convolution with STS notably improves a wide range of 3D CNNs without increasing parameters and computation on both Kinetics-400 and Something-Something V2. Moreover, this new training pipeline consistently achieves better results on video recognition with significant speedup. For instance, we achieve +0.6% top-1 of Slowfast on Kinetics-400 over the strong 256-epoch 128-GPU baseline while fine-tuning for only 50 epochs with 4 GPUs. The code and models are available at https://github.com/UCSC-VLAA/Image-Pretraining-for-Video.Comment: Published as a conference paper at ECCV 202

    FedConv: Enhancing Convolutional Neural Networks for Handling Data Heterogeneity in Federated Learning

    Full text link
    Federated learning (FL) is an emerging paradigm in machine learning, where a shared model is collaboratively learned using data from multiple devices to mitigate the risk of data leakage. While recent studies posit that Vision Transformer (ViT) outperforms Convolutional Neural Networks (CNNs) in addressing data heterogeneity in FL, the specific architectural components that underpin this advantage have yet to be elucidated. In this paper, we systematically investigate the impact of different architectural elements, such as activation functions and normalization layers, on the performance within heterogeneous FL. Through rigorous empirical analyses, we are able to offer the first-of-its-kind general guidance on micro-architecture design principles for heterogeneous FL. Intriguingly, our findings indicate that with strategic architectural modifications, pure CNNs can achieve a level of robustness that either matches or even exceeds that of ViTs when handling heterogeneous data clients in FL. Additionally, our approach is compatible with existing FL techniques and delivers state-of-the-art solutions across a broad spectrum of FL benchmarks. The code is publicly available at https://github.com/UCSC-VLAA/FedConvComment: 9 pages, 6 figures. Equal contribution by P. Xu and Z. Wan

    SwinMM: Masked Multi-view with Swin Transformers for 3D Medical Image Segmentation

    Full text link
    Recent advancements in large-scale Vision Transformers have made significant strides in improving pre-trained models for medical image segmentation. However, these methods face a notable challenge in acquiring a substantial amount of pre-training data, particularly within the medical field. To address this limitation, we present Masked Multi-view with Swin Transformers (SwinMM), a novel multi-view pipeline for enabling accurate and data-efficient self-supervised medical image analysis. Our strategy harnesses the potential of multi-view information by incorporating two principal components. In the pre-training phase, we deploy a masked multi-view encoder devised to concurrently train masked multi-view observations through a range of diverse proxy tasks. These tasks span image reconstruction, rotation, contrastive learning, and a novel task that employs a mutual learning paradigm. This new task capitalizes on the consistency between predictions from various perspectives, enabling the extraction of hidden multi-view information from 3D medical data. In the fine-tuning stage, a cross-view decoder is developed to aggregate the multi-view information through a cross-attention block. Compared with the previous state-of-the-art self-supervised learning method Swin UNETR, SwinMM demonstrates a notable advantage on several medical image segmentation tasks. It allows for a smooth integration of multi-view information, significantly boosting both the accuracy and data-efficiency of the model. Code and models are available at https://github.com/UCSC-VLAA/SwinMM/.Comment: MICCAI 2023; project page: https://github.com/UCSC-VLAA/SwinMM

    TOWARDS NETWORKS WITH EFFICIENCY, EXPLAINABILITY AND ROBUSTNESS

    No full text
    Over the past decades, the field of Computer Vision has experienced remarkable success, largely attributed to the evolution of various underlying network architectures. However, the real-world deployment of these visual recognition systems has surfaced several challenges. Key among them is achieving operational efficiency, particularly in terms of computational cost. Additionally, there is a pressing need to move beyond black-box models towards systems that are explainable and capable of error rectification. Moreover, ensuring robustness against unforeseen scenarios and malicious attacks is crucial. This dissertation focuses on addressing these critical aspects of network architecture. The first part delves into enhancing the efficiency of Convolutional Neural Networks, focusing on refining architectural building blocks and designing structures that leverage data characteristics. The second section examines network robustness, particularly through the strategic use of adversarial examples to enhance network performance and resilience. The final part demonstrates the integration of superpixel representation with transformers, synergistically combining efficiency, explainability, and robustness

    The Double-Leucine Motifs Affect Internalization, Stability, and Function of Organic Anion Transporting Polypeptide 1B1

    No full text
    Organic anion transporting polypeptide 1B1 (OATP1B1) is specifically expressed at the basolateral membrane of human hepatocytes and plays important roles in the uptake of various endogenous and exogenous compounds including many drugs. The proper functioning of OATP1B1, hence, is essential for the bioavailability of various therapeutic agents and needs to be tightly regulated. Dileucine-based signals are involved in lysosomal targeting, internalization, and trans-Golgi network to endosome transporting of membrane proteins. In the current study, we analyzed the 3 intracellular and 13 transmembrane dileucine motifs (DLMs) within the sequence of OATP1B1. It was found that the simultaneous replacement of I332 and L333 with alanine resulted in a significantly reduced level of the mature form of OATP1B1. The cell surface expression of I332A/L333A could be partially rescued by MG132, as well as agents that prevent clathrin-dependent protein internalization, suggesting that this dileucine motif may be involved in the endocytosis of OATP1B1. On the other hand, I376/L377 and I642/L643, which are localized at transmembrane helices (TM) 8 and 12, respectively, are involved in the interaction of the transporter with its substrates. I642A/L643A exhibited a significantly decreased protein level compared to that of the wild-type, implying that the motif is important for maintaining the stability of OATP1B1 as well
    corecore