40 research outputs found
Deep Pyramidal Residual Networks
Deep convolutional neural networks (DCNNs) have shown remarkable performance
in image classification tasks in recent years. Generally, deep neural network
architectures are stacks consisting of a large number of convolutional layers,
and they perform downsampling along the spatial dimension via pooling to reduce
memory usage. Concurrently, the feature map dimension (i.e., the number of
channels) is sharply increased at downsampling locations, which is essential to
ensure effective performance because it increases the diversity of high-level
attributes. This also applies to residual networks and is very closely related
to their performance. In this research, instead of sharply increasing the
feature map dimension at units that perform downsampling, we gradually increase
the feature map dimension at all units to involve as many locations as
possible. This design, which is discussed in depth together with our new
insights, has proven to be an effective means of improving generalization
ability. Furthermore, we propose a novel residual unit capable of further
improving the classification accuracy with our new network architecture.
Experiments on benchmark CIFAR-10, CIFAR-100, and ImageNet datasets have shown
that our network architecture has superior generalization ability compared to
the original residual networks. Code is available at
https://github.com/jhkim89/PyramidNet}Comment: Accepted to CVPR 201
Spatiotemporal Augmentation on Selective Frequencies for Video Representation Learning
Recent self-supervised video representation learning methods focus on
maximizing the similarity between multiple augmented views from the same video
and largely rely on the quality of generated views. In this paper, we propose
frequency augmentation (FreqAug), a spatio-temporal data augmentation method in
the frequency domain for video representation learning. FreqAug stochastically
removes undesirable information from the video by filtering out specific
frequency components so that learned representation captures essential features
of the video for various downstream tasks. Specifically, FreqAug pushes the
model to focus more on dynamic features rather than static features in the
video via dropping spatial or temporal low-frequency components. In other
words, learning invariance between remaining frequency components results in
high-frequency enhanced representation with less static bias. To verify the
generality of the proposed method, we experiment with FreqAug on multiple
self-supervised learning frameworks along with standard augmentations.
Transferring the improved representation to five video action recognition and
two temporal action localization downstream tasks shows consistent improvements
over baselines
Gramian Attention Heads are Strong yet Efficient Vision Learners
We introduce a novel architecture design that enhances expressiveness by
incorporating multiple head classifiers (\ie, classification heads) instead of
relying on channel expansion or additional building blocks. Our approach
employs attention-based aggregation, utilizing pairwise feature similarity to
enhance multiple lightweight heads with minimal resource overhead. We compute
the Gramian matrices to reinforce class tokens in an attention layer for each
head. This enables the heads to learn more discriminative representations,
enhancing their aggregation capabilities. Furthermore, we propose a learning
algorithm that encourages heads to complement each other by reducing
correlation for aggregation. Our models eventually surpass state-of-the-art
CNNs and ViTs regarding the accuracy-throughput trade-off on ImageNet-1K and
deliver remarkable performance across various downstream tasks, such as COCO
object instance segmentation, ADE20k semantic segmentation, and fine-grained
visual classification datasets. The effectiveness of our framework is
substantiated by practical experimental results and further underpinned by
generalization error bound. We release the code publicly at:
https://github.com/Lab-LVM/imagenet-models
Contrastive Vicinal Space for Unsupervised Domain Adaptation
Recent unsupervised domain adaptation methods have utilized vicinal space
between the source and target domains. However, the equilibrium collapse of
labels, a problem where the source labels are dominant over the target labels
in the predictions of vicinal instances, has never been addressed. In this
paper, we propose an instance-wise minimax strategy that minimizes the entropy
of high uncertainty instances in the vicinal space to tackle the stated
problem. We divide the vicinal space into two subspaces through the solution of
the minimax problem: contrastive space and consensus space. In the contrastive
space, inter-domain discrepancy is mitigated by constraining instances to have
contrastive views and labels, and the consensus space reduces the confusion
between intra-domain categories. The effectiveness of our method is
demonstrated on public benchmarks, including Office-31, Office-Home, and
VisDA-C, achieving state-of-the-art performances. We further show that our
method outperforms the current state-of-the-art methods on PACS, which
indicates that our instance-wise approach works well for multi-source domain
adaptation as well. Code is available at https://github.com/NaJaeMin92/CoVi.Comment: 10 pages, 7 figures, 5 table
The Devil is in the Points: Weakly Semi-Supervised Instance Segmentation via Point-Guided Mask Representation
In this paper, we introduce a novel learning scheme named weakly
semi-supervised instance segmentation (WSSIS) with point labels for
budget-efficient and high-performance instance segmentation. Namely, we
consider a dataset setting consisting of a few fully-labeled images and a lot
of point-labeled images. Motivated by the main challenge of semi-supervised
approaches mainly derives from the trade-off between false-negative and
false-positive instance proposals, we propose a method for WSSIS that can
effectively leverage the budget-friendly point labels as a powerful weak
supervision source to resolve the challenge. Furthermore, to deal with the hard
case where the amount of fully-labeled data is extremely limited, we propose a
MaskRefineNet that refines noise in rough masks. We conduct extensive
experiments on COCO and BDD100K datasets, and the proposed method achieves
promising results comparable to those of the fully-supervised model, even with
50% of the fully labeled COCO data (38.8% vs. 39.7%). Moreover, when using as
little as 5% of fully labeled COCO data, our method shows significantly
superior performance over the state-of-the-art semi-supervised learning method
(33.7% vs. 24.9%). The code is available at
https://github.com/clovaai/PointWSSIS.Comment: CVPR 202
Match me if you can: Semantic Correspondence Learning with Unpaired Images
Recent approaches for semantic correspondence have focused on obtaining
high-quality correspondences using a complicated network, refining the
ambiguous or noisy matching points. Despite their performance improvements,
they remain constrained by the limited training pairs due to costly point-level
annotations. This paper proposes a simple yet effective method that performs
training with unlabeled pairs to complement both limited image pairs and sparse
point pairs, requiring neither extra labeled keypoints nor trainable modules.
We fundamentally extend the data quantity and variety by augmenting new
unannotated pairs not primitively provided as training pairs in benchmarks.
Using a simple teacher-student framework, we offer reliable pseudo
correspondences to the student network via machine supervision. Finally, the
performance of our network is steadily improved by the proposed iterative
training, putting back the student as a teacher to generate refined labels and
train a new student repeatedly. Our models outperform the milestone baselines,
including state-of-the-art methods on semantic correspondence benchmarks.Comment: 12 page