963 research outputs found
Learning Aligned Cross-Modal Representations from Weakly Aligned Data
People can recognize scenes across many different modalities beyond natural
images. In this paper, we investigate how to learn cross-modal scene
representations that transfer across modalities. To study this problem, we
introduce a new cross-modal scene dataset. While convolutional neural networks
can categorize cross-modal scenes well, they also learn an intermediate
representation not aligned across modalities, which is undesirable for
cross-modal transfer applications. We present methods to regularize cross-modal
convolutional neural networks so that they have a shared representation that is
agnostic of the modality. Our experiments suggest that our scene representation
can help transfer representations across modalities for retrieval. Moreover,
our visualizations suggest that units emerge in the shared representation that
tend to activate on consistent concepts independently of the modality.Comment: Conference paper at CVPR 201
Stable and Causal Inference for Discriminative Self-supervised Deep Visual Representations
In recent years, discriminative self-supervised methods have made significant
strides in advancing various visual tasks. The central idea of learning a data
encoder that is robust to data distortions/augmentations is straightforward yet
highly effective. Although many studies have demonstrated the empirical success
of various learning methods, the resulting learned representations can exhibit
instability and hinder downstream performance. In this study, we analyze
discriminative self-supervised methods from a causal perspective to explain
these unstable behaviors and propose solutions to overcome them. Our approach
draws inspiration from prior works that empirically demonstrate the ability of
discriminative self-supervised methods to demix ground truth causal sources to
some extent. Unlike previous work on causality-empowered representation
learning, we do not apply our solutions during the training process but rather
during the inference process to improve time efficiency. Through experiments on
both controlled image datasets and realistic image datasets, we show that our
proposed solutions, which involve tempering a linear transformation with
controlled synthetic data, are effective in addressing these issues.Comment: ICCV 2023 accepted pape
Deformable Capsules for Object Detection
In this study, we introduce a new family of capsule networks, deformable
capsules (DeformCaps), to address a very important problem in computer vision:
object detection. We propose two new algorithms associated with our DeformCaps:
a novel capsule structure (SplitCaps), and a novel dynamic routing algorithm
(SE-Routing), which balance computational efficiency with the need for modeling
a large number of objects and classes, which have never been achieved with
capsule networks before. We demonstrate that the proposed methods allow
capsules to efficiently scale-up to large-scale computer vision tasks for the
first time, and create the first-ever capsule network for object detection in
the literature. Our proposed architecture is a one-stage detection framework
and obtains results on MS COCO which are on-par with state-of-the-art one-stage
CNN-based methods, while producing fewer false positive detections,
generalizing to unusual poses/viewpoints of objects
- …