93 research outputs found
Learning Symmetrization for Equivariance with Orbit Distance Minimization
We present a general framework for symmetrizing an arbitrary neural-network
architecture and making it equivariant with respect to a given group. We build
upon the proposals of Kim et al. (2023); Kaba et al. (2023) for symmetrization,
and improve them by replacing their conversion of neural features into group
representations, with an optimization whose loss intuitively measures the
distance between group orbits. This change makes our approach applicable to a
broader range of matrix groups, such as the Lorentz group O(1, 3), than these
two proposals. We experimentally show our method's competitiveness on the SO(2)
image classification task, and also its increased generality on the task with
O(1, 3). Our implementation will be made accessible at
https://github.com/tiendatnguyen-vision/Orbit-symmetrize.Comment: 16 pages, 1 figur
Universal Few-shot Learning of Dense Prediction Tasks with Visual Token Matching
Dense prediction tasks are a fundamental class of problems in computer
vision. As supervised methods suffer from high pixel-wise labeling cost, a
few-shot learning solution that can learn any dense task from a few labeled
images is desired. Yet, current few-shot learning methods target a restricted
set of tasks such as semantic segmentation, presumably due to challenges in
designing a general and unified model that is able to flexibly and efficiently
adapt to arbitrary tasks of unseen semantics. We propose Visual Token Matching
(VTM), a universal few-shot learner for arbitrary dense prediction tasks. It
employs non-parametric matching on patch-level embedded tokens of images and
labels that encapsulates all tasks. Also, VTM flexibly adapts to any task with
a tiny amount of task-specific parameters that modulate the matching algorithm.
We implement VTM as a powerful hierarchical encoder-decoder architecture
involving ViT backbones where token matching is performed at multiple feature
hierarchies. We experiment VTM on a challenging variant of Taskonomy dataset
and observe that it robustly few-shot learns various unseen dense prediction
tasks. Surprisingly, it is competitive with fully supervised baselines using
only 10 labeled examples of novel tasks (0.004% of full supervision) and
sometimes outperforms using 0.1% of full supervision. Codes are available at
https://github.com/GitGyun/visual_token_matching
Towards End-to-End Generative Modeling of Long Videos with Memory-Efficient Bidirectional Transformers
Autoregressive transformers have shown remarkable success in video
generation. However, the transformers are prohibited from directly learning the
long-term dependency in videos due to the quadratic complexity of
self-attention, and inherently suffering from slow inference time and error
propagation due to the autoregressive process. In this paper, we propose
Memory-efficient Bidirectional Transformer (MeBT) for end-to-end learning of
long-term dependency in videos and fast inference. Based on recent advances in
bidirectional transformers, our method learns to decode the entire
spatio-temporal volume of a video in parallel from partially observed patches.
The proposed transformer achieves a linear time complexity in both encoding and
decoding, by projecting observable context tokens into a fixed number of latent
tokens and conditioning them to decode the masked tokens through the
cross-attention. Empowered by linear complexity and bidirectional modeling, our
method demonstrates significant improvement over the autoregressive
Transformers for generating moderately long videos in both quality and speed.
Videos and code are available at https://sites.google.com/view/mebt-cvpr2023
- …