97 research outputs found
Deep Pyramidal Residual Networks
Deep convolutional neural networks (DCNNs) have shown remarkable performance
in image classification tasks in recent years. Generally, deep neural network
architectures are stacks consisting of a large number of convolutional layers,
and they perform downsampling along the spatial dimension via pooling to reduce
memory usage. Concurrently, the feature map dimension (i.e., the number of
channels) is sharply increased at downsampling locations, which is essential to
ensure effective performance because it increases the diversity of high-level
attributes. This also applies to residual networks and is very closely related
to their performance. In this research, instead of sharply increasing the
feature map dimension at units that perform downsampling, we gradually increase
the feature map dimension at all units to involve as many locations as
possible. This design, which is discussed in depth together with our new
insights, has proven to be an effective means of improving generalization
ability. Furthermore, we propose a novel residual unit capable of further
improving the classification accuracy with our new network architecture.
Experiments on benchmark CIFAR-10, CIFAR-100, and ImageNet datasets have shown
that our network architecture has superior generalization ability compared to
the original residual networks. Code is available at
https://github.com/jhkim89/PyramidNet}Comment: Accepted to CVPR 201
Spatiotemporal Augmentation on Selective Frequencies for Video Representation Learning
Recent self-supervised video representation learning methods focus on
maximizing the similarity between multiple augmented views from the same video
and largely rely on the quality of generated views. In this paper, we propose
frequency augmentation (FreqAug), a spatio-temporal data augmentation method in
the frequency domain for video representation learning. FreqAug stochastically
removes undesirable information from the video by filtering out specific
frequency components so that learned representation captures essential features
of the video for various downstream tasks. Specifically, FreqAug pushes the
model to focus more on dynamic features rather than static features in the
video via dropping spatial or temporal low-frequency components. In other
words, learning invariance between remaining frequency components results in
high-frequency enhanced representation with less static bias. To verify the
generality of the proposed method, we experiment with FreqAug on multiple
self-supervised learning frameworks along with standard augmentations.
Transferring the improved representation to five video action recognition and
two temporal action localization downstream tasks shows consistent improvements
over baselines
Improving Performance of Semi-Supervised Learning by Adversarial Attacks
Semi-supervised learning (SSL) algorithm is a setup built upon a realistic
assumption that access to a large amount of labeled data is tough. In this
study, we present a generalized framework, named SCAR, standing for Selecting
Clean samples with Adversarial Robustness, for improving the performance of
recent SSL algorithms. By adversarially attacking pre-trained models with
semi-supervision, our framework shows substantial advances in classifying
images. We introduce how adversarial attacks successfully select high-confident
unlabeled data to be labeled with current predictions. On CIFAR10, three recent
SSL algorithms with SCAR result in significantly improved image classification.Comment: 4 page
You Only Train Once: Multi-Identity Free-Viewpoint Neural Human Rendering from Monocular Videos
We introduce You Only Train Once (YOTO), a dynamic human generation
framework, which performs free-viewpoint rendering of different human
identities with distinct motions, via only one-time training from monocular
videos. Most prior works for the task require individualized optimization for
each input video that contains a distinct human identity, leading to a
significant amount of time and resources for the deployment, thereby impeding
the scalability and the overall application potential of the system. In this
paper, we tackle this problem by proposing a set of learnable identity codes to
expand the capability of the framework for multi-identity free-viewpoint
rendering, and an effective pose-conditioned code query mechanism to finely
model the pose-dependent non-rigid motions. YOTO optimizes neural radiance
fields (NeRF) by utilizing designed identity codes to condition the model for
learning various canonical T-pose appearances in a single shared volumetric
representation. Besides, our joint learning of multiple identities within a
unified model incidentally enables flexible motion transfer in high-quality
photo-realistic renderings for all learned appearances. This capability expands
its potential use in important applications, including Virtual Reality. We
present extensive experimental results on ZJU-MoCap and PeopleSnapshot to
clearly demonstrate the effectiveness of our proposed model. YOTO shows
state-of-the-art performance on all evaluation metrics while showing
significant benefits in training and inference efficiency as well as rendering
quality. The code and model will be made publicly available soon
Enhancing Adversarial Robustness in Low-Label Regime via Adaptively Weighted Regularization and Knowledge Distillation
Adversarial robustness is a research area that has recently received a lot of
attention in the quest for trustworthy artificial intelligence. However, recent
works on adversarial robustness have focused on supervised learning where it is
assumed that labeled data is plentiful. In this paper, we investigate
semi-supervised adversarial training where labeled data is scarce. We derive
two upper bounds for the robust risk and propose a regularization term for
unlabeled data motivated by these two upper bounds. Then, we develop a
semi-supervised adversarial training algorithm that combines the proposed
regularization term with knowledge distillation using a semi-supervised teacher
(i.e., a teacher model trained using a semi-supervised learning algorithm). Our
experiments show that our proposed algorithm achieves state-of-the-art
performance with significant margins compared to existing algorithms. In
particular, compared to supervised learning algorithms, performance of our
proposed algorithm is not much worse even when the amount of labeled data is
very small. For example, our algorithm with only 8\% labeled data is comparable
to supervised adversarial training algorithms that use all labeled data, both
in terms of standard and robust accuracies on CIFAR-10.Comment: 9 pages - Manuscript, 6 pages - Appendix, Accepted in ICCV 202
Masked Autoencoder for Unsupervised Video Summarization
Summarizing a video requires a diverse understanding of the video, ranging
from recognizing scenes to evaluating how much each frame is essential enough
to be selected as a summary. Self-supervised learning (SSL) is acknowledged for
its robustness and flexibility to multiple downstream tasks, but the video SSL
has not shown its value for dense understanding tasks like video summarization.
We claim an unsupervised autoencoder with sufficient self-supervised learning
does not need any extra downstream architecture design or fine-tuning weights
to be utilized as a video summarization model. The proposed method to evaluate
the importance score of each frame takes advantage of the reconstruction score
of the autoencoder's decoder. We evaluate the method in major unsupervised
video summarization benchmarks to show its effectiveness under various
experimental settings
Match me if you can: Semantic Correspondence Learning with Unpaired Images
Recent approaches for semantic correspondence have focused on obtaining
high-quality correspondences using a complicated network, refining the
ambiguous or noisy matching points. Despite their performance improvements,
they remain constrained by the limited training pairs due to costly point-level
annotations. This paper proposes a simple yet effective method that performs
training with unlabeled pairs to complement both limited image pairs and sparse
point pairs, requiring neither extra labeled keypoints nor trainable modules.
We fundamentally extend the data quantity and variety by augmenting new
unannotated pairs not primitively provided as training pairs in benchmarks.
Using a simple teacher-student framework, we offer reliable pseudo
correspondences to the student network via machine supervision. Finally, the
performance of our network is steadily improved by the proposed iterative
training, putting back the student as a teacher to generate refined labels and
train a new student repeatedly. Our models outperform the milestone baselines,
including state-of-the-art methods on semantic correspondence benchmarks.Comment: 12 page
The Devil is in the Points: Weakly Semi-Supervised Instance Segmentation via Point-Guided Mask Representation
In this paper, we introduce a novel learning scheme named weakly
semi-supervised instance segmentation (WSSIS) with point labels for
budget-efficient and high-performance instance segmentation. Namely, we
consider a dataset setting consisting of a few fully-labeled images and a lot
of point-labeled images. Motivated by the main challenge of semi-supervised
approaches mainly derives from the trade-off between false-negative and
false-positive instance proposals, we propose a method for WSSIS that can
effectively leverage the budget-friendly point labels as a powerful weak
supervision source to resolve the challenge. Furthermore, to deal with the hard
case where the amount of fully-labeled data is extremely limited, we propose a
MaskRefineNet that refines noise in rough masks. We conduct extensive
experiments on COCO and BDD100K datasets, and the proposed method achieves
promising results comparable to those of the fully-supervised model, even with
50% of the fully labeled COCO data (38.8% vs. 39.7%). Moreover, when using as
little as 5% of fully labeled COCO data, our method shows significantly
superior performance over the state-of-the-art semi-supervised learning method
(33.7% vs. 24.9%). The code is available at
https://github.com/clovaai/PointWSSIS.Comment: CVPR 202
- …