54 research outputs found
Object-centric Learning with Cyclic Walks between Parts and Whole
Learning object-centric representations from complex natural environments
enables both humans and machines with reasoning abilities from low-level
perceptual features. To capture compositional entities of the scene, we
proposed cyclic walks between perceptual features extracted from CNN or
transformers and object entities. First, a slot-attention module interfaces
with these perceptual features and produces a finite set of slot
representations. These slots can bind to any object entities in the scene via
inter-slot competitions for attention. Next, we establish entity-feature
correspondence with cyclic walks along high transition probability based on
pairwise similarity between perceptual features (aka "parts") and slot-binded
object representations (aka "whole"). The whole is greater than its parts and
the parts constitute the whole. The part-whole interactions form cycle
consistencies, as supervisory signals, to train the slot-attention module. We
empirically demonstrate that the networks trained with our cyclic walks can
extract object-centric representations on seven image datasets in three
unsupervised learning tasks. In contrast to object-centric models attached with
a decoder for image or feature reconstructions, our cyclic walks provide strong
supervision signals, avoiding computation overheads and enhancing memory
efficiency
TaCA: Upgrading Your Visual Foundation Model with Task-agnostic Compatible Adapter
Visual foundation models like CLIP excel in learning feature representations
from extensive datasets through self-supervised methods, demonstrating
remarkable transfer learning and generalization capabilities. A growing number
of applications based on visual foundation models are emerging, including
innovative solutions such as BLIP-2. These applications employ pre-trained CLIP
models as upstream feature extractors and train various downstream modules to
accomplish diverse tasks. In situations involving system upgrades that require
updating the upstream foundation model, it becomes essential to re-train all
downstream modules to adapt to the new foundation model, which is inflexible
and inefficient. In this paper, we introduce a parameter-efficient and
task-agnostic adapter, dubbed TaCA, that facilitates compatibility across
distinct foundation models while ensuring enhanced performance for the new
models. TaCA allows downstream applications to seamlessly integrate
better-performing foundation models without necessitating retraining. We
conduct extensive experimental validation of TaCA using different scales of
models with up to one billion parameters on various tasks such as video-text
retrieval, video recognition, and visual question answering. The results
consistently demonstrate the emergent ability of TaCA on hot-plugging upgrades
for visual foundation models. Codes and models will be available at
https://github.com/TencentARC/TaCA
Revisiting Vision Transformer from the View of Path Ensemble
Vision Transformers (ViTs) are normally regarded as a stack of transformer
layers. In this work, we propose a novel view of ViTs showing that they can be
seen as ensemble networks containing multiple parallel paths with different
lengths. Specifically, we equivalently transform the traditional cascade of
multi-head self-attention (MSA) and feed-forward network (FFN) into three
parallel paths in each transformer layer. Then, we utilize the identity
connection in our new transformer form and further transform the ViT into an
explicit multi-path ensemble network. From the new perspective, these paths
perform two functions: the first is to provide the feature for the classifier
directly, and the second is to provide the lower-level feature representation
for subsequent longer paths. We investigate the influence of each path for the
final prediction and discover that some paths even pull down the performance.
Therefore, we propose the path pruning and EnsembleScale skills for
improvement, which cut out the underperforming paths and re-weight the ensemble
components, respectively, to optimize the path combination and make the short
paths focus on providing high-quality representation for subsequent paths. We
also demonstrate that our path combination strategies can help ViTs go deeper
and act as high-pass filters to filter out partial low-frequency signals. To
further enhance the representation of paths served for subsequent paths,
self-distillation is applied to transfer knowledge from the long paths to the
short paths. This work calls for more future research to explain and design
ViTs from new perspectives.Comment: Accepted by ICCV 2023, oral presentatio
BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion
Recent text-to-image diffusion models have demonstrated an astonishing
capacity to generate high-quality images. However, researchers mainly studied
the way of synthesizing images with only text prompts. While some works have
explored using other modalities as conditions, considerable paired data, e.g.,
box/mask-image pairs, and fine-tuning time are required for nurturing models.
As such paired data is time-consuming and labor-intensive to acquire and
restricted to a closed set, this potentially becomes the bottleneck for
applications in an open world. This paper focuses on the simplest form of
user-provided conditions, e.g., box or scribble. To mitigate the aforementioned
problem, we propose a training-free method to control objects and contexts in
the synthesized images adhering to the given spatial conditions. Specifically,
three spatial constraints, i.e., Inner-Box, Outer-Box, and Corner Constraints,
are designed and seamlessly integrated into the denoising step of diffusion
models, requiring no additional training and massive annotated layout data.
Extensive results show that the proposed constraints can control what and where
to present in the images while retaining the ability of the Stable Diffusion
model to synthesize with high fidelity and diverse concept coverage. The code
is publicly available at https://github.com/Sierkinhane/BoxDiff.Comment: Accepted by ICCV 2023. The paper is still being revised for better
organization and comparison. Code is available at:
https://github.com/Sierkinhane/BoxDif
Revisit Parameter-Efficient Transfer Learning: A Two-Stage Paradigm
Parameter-Efficient Transfer Learning (PETL) aims at efficiently adapting
large models pre-trained on massive data to downstream tasks with limited
task-specific data. In view of the practicality of PETL, previous works focus
on tuning a small set of parameters for each downstream task in an end-to-end
manner while rarely considering the task distribution shift issue between the
pre-training task and the downstream task. This paper proposes a novel
two-stage paradigm, where the pre-trained model is first aligned to the target
distribution. Then the task-relevant information is leveraged for effective
adaptation. Specifically, the first stage narrows the task distribution shift
by tuning the scale and shift in the LayerNorm layers. In the second stage, to
efficiently learn the task-relevant information, we propose a Taylor
expansion-based importance score to identify task-relevant channels for the
downstream task and then only tune such a small portion of channels, making the
adaptation to be parameter-efficient. Overall, we present a promising new
direction for PETL, and the proposed paradigm achieves state-of-the-art
performance on the average accuracy of 19 downstream tasks.Comment: 11 page
Open-World Weakly-Supervised Object Localization
While remarkable success has been achieved in weakly-supervised object
localization (WSOL), current frameworks are not capable of locating objects of
novel categories in open-world settings. To address this issue, we are the
first to introduce a new weakly-supervised object localization task called
OWSOL (Open-World Weakly-Supervised Object Localization). During training, all
labeled data comes from known categories and, both known and novel categories
exist in the unlabeled data. To handle such data, we propose a novel paradigm
of contrastive representation co-learning using both labeled and unlabeled data
to generate a complete G-CAM (Generalized Class Activation Map) for object
localization, without the requirement of bounding box annotation. As no class
label is available for the unlabelled data, we conduct clustering over the full
training set and design a novel multiple semantic centroids-driven contrastive
loss for representation learning. We re-organize two widely used datasets,
i.e., ImageNet-1K and iNatLoc500, and propose OpenImages150 to serve as
evaluation benchmarks for OWSOL. Extensive experiments demonstrate that the
proposed method can surpass all baselines by a large margin. We believe that
this work can shift the close-set localization towards the open-world setting
and serve as a foundation for subsequent works. Code will be released at
https://github.com/ryylcc/OWSOL
Bridging Sensor Gaps via Single-Direction Tuning for Hyperspectral Image Classification
Recently, some researchers started exploring the use of ViTs in tackling HSI
classification and achieved remarkable results. However, the training of ViT
models requires a considerable number of training samples, while hyperspectral
data, due to its high annotation costs, typically has a relatively small number
of training samples. This contradiction has not been effectively addressed. In
this paper, aiming to solve this problem, we propose the single-direction
tuning (SDT) strategy, which serves as a bridge, allowing us to leverage
existing labeled HSI datasets even RGB datasets to enhance the performance on
new HSI datasets with limited samples. The proposed SDT inherits the idea of
prompt tuning, aiming to reuse pre-trained models with minimal modifications
for adaptation to new tasks. But unlike prompt tuning, SDT is custom-designed
to accommodate the characteristics of HSIs. The proposed SDT utilizes a
parallel architecture, an asynchronous cold-hot gradient update strategy, and
unidirectional interaction. It aims to fully harness the potent representation
learning capabilities derived from training on heterologous, even cross-modal
datasets. In addition, we also introduce a novel Triplet-structured transformer
(Tri-Former), where spectral attention and spatial attention modules are merged
in parallel to construct the token mixing component for reducing computation
cost and a 3D convolution-based channel mixer module is integrated to enhance
stability and keep structure information. Comparison experiments conducted on
three representative HSI datasets captured by different sensors demonstrate the
proposed Tri-Former achieves better performance compared to several
state-of-the-art methods. Homologous, heterologous and cross-modal tuning
experiments verified the effectiveness of the proposed SDT
Attack is Good Augmentation: Towards Skeleton-Contrastive Representation Learning
Contrastive learning, relying on effective positive and negative sample
pairs, is beneficial to learn informative skeleton representations in
unsupervised skeleton-based action recognition. To achieve these positive and
negative pairs, existing weak/strong data augmentation methods have to randomly
change the appearance of skeletons for indirectly pursuing semantic
perturbations. However, such approaches have two limitations: 1) solely
perturbing appearance cannot well capture the intrinsic semantic information of
skeletons, and 2) randomly perturbation may change the original
positive/negative pairs to soft positive/negative ones. To address the above
dilemma, we start the first attempt to explore an attack-based augmentation
scheme that additionally brings in direct semantic perturbation, for
constructing hard positive pairs and further assisting in constructing hard
negative pairs. In particular, we propose a novel Attack-Augmentation
Mixing-Contrastive learning (AMC) to contrast hard positive features and
hard negative features for learning more robust skeleton representations. In
AMC, Attack-Augmentation (Att-Aug) is designed to collaboratively perform
targeted and untargeted perturbations of skeletons via attack and augmentation
respectively, for generating high-quality hard positive features. Meanwhile,
Positive-Negative Mixer (PNM) is presented to mix hard positive features and
negative features for generating hard negative features, which are adopted for
updating the mixed memory banks. Extensive experiments on three public datasets
demonstrate that AMC is competitive with the state-of-the-art methods
AVA-AVD: Audio-Visual Speaker Diarization in the Wild
Audio-visual speaker diarization aims at detecting "who spoke when" using
both auditory and visual signals. Existing audio-visual diarization datasets
are mainly focused on indoor environments like meeting rooms or news studios,
which are quite different from in-the-wild videos in many scenarios such as
movies, documentaries, and audience sitcoms. To develop diarization methods for
these challenging videos, we create the AVA Audio-Visual Diarization (AVA-AVD)
dataset. Our experiments demonstrate that adding AVA-AVD into training set can
produce significantly better diarization models for in-the-wild videos despite
that the data is relatively small. Moreover, this benchmark is challenging due
to the diverse scenes, complicated acoustic conditions, and completely
off-screen speakers. As a first step towards addressing the challenges, we
design the Audio-Visual Relation Network (AVR-Net) which introduces a simple
yet effective modality mask to capture discriminative information based on face
visibility. Experiments show that our method not only can outperform
state-of-the-art methods but is more robust as varying the ratio of off-screen
speakers. Our data and code has been made publicly available at
https://github.com/showlab/AVA-AVD.Comment: ACMMM 202
SCT: A Simple Baseline for Parameter-Efficient Fine-Tuning via Salient Channels
Pre-trained vision transformers have strong representation benefits to
various downstream tasks. Recently, many parameter-efficient fine-tuning (PEFT)
methods have been proposed, and their experiments demonstrate that tuning only
1% of extra parameters could surpass full fine-tuning in low-data resource
scenarios. However, these methods overlook the task-specific information when
fine-tuning diverse downstream tasks. In this paper, we propose a simple yet
effective method called "Salient Channel Tuning" (SCT) to leverage the
task-specific information by forwarding the model with the task images to
select partial channels in a feature map that enables us to tune only 1/8
channels leading to significantly lower parameter costs. Experiments outperform
full fine-tuning on 18 out of 19 tasks in the VTAB-1K benchmark by adding only
0.11M parameters of the ViT-B, which is 780 fewer than its full
fine-tuning counterpart. Furthermore, experiments on domain generalization and
few-shot learning surpass other PEFT methods with lower parameter costs,
demonstrating our proposed tuning technique's strong capability and
effectiveness in the low-data regime.Comment: This work has been accepted by IJCV202
- …