245 research outputs found
Universal Few-shot Learning of Dense Prediction Tasks with Visual Token Matching
Dense prediction tasks are a fundamental class of problems in computer
vision. As supervised methods suffer from high pixel-wise labeling cost, a
few-shot learning solution that can learn any dense task from a few labeled
images is desired. Yet, current few-shot learning methods target a restricted
set of tasks such as semantic segmentation, presumably due to challenges in
designing a general and unified model that is able to flexibly and efficiently
adapt to arbitrary tasks of unseen semantics. We propose Visual Token Matching
(VTM), a universal few-shot learner for arbitrary dense prediction tasks. It
employs non-parametric matching on patch-level embedded tokens of images and
labels that encapsulates all tasks. Also, VTM flexibly adapts to any task with
a tiny amount of task-specific parameters that modulate the matching algorithm.
We implement VTM as a powerful hierarchical encoder-decoder architecture
involving ViT backbones where token matching is performed at multiple feature
hierarchies. We experiment VTM on a challenging variant of Taskonomy dataset
and observe that it robustly few-shot learns various unseen dense prediction
tasks. Surprisingly, it is competitive with fully supervised baselines using
only 10 labeled examples of novel tasks (0.004% of full supervision) and
sometimes outperforms using 0.1% of full supervision. Codes are available at
https://github.com/GitGyun/visual_token_matching
Towards End-to-End Generative Modeling of Long Videos with Memory-Efficient Bidirectional Transformers
Autoregressive transformers have shown remarkable success in video
generation. However, the transformers are prohibited from directly learning the
long-term dependency in videos due to the quadratic complexity of
self-attention, and inherently suffering from slow inference time and error
propagation due to the autoregressive process. In this paper, we propose
Memory-efficient Bidirectional Transformer (MeBT) for end-to-end learning of
long-term dependency in videos and fast inference. Based on recent advances in
bidirectional transformers, our method learns to decode the entire
spatio-temporal volume of a video in parallel from partially observed patches.
The proposed transformer achieves a linear time complexity in both encoding and
decoding, by projecting observable context tokens into a fixed number of latent
tokens and conditioning them to decode the masked tokens through the
cross-attention. Empowered by linear complexity and bidirectional modeling, our
method demonstrates significant improvement over the autoregressive
Transformers for generating moderately long videos in both quality and speed.
Videos and code are available at https://sites.google.com/view/mebt-cvpr2023
Dynamic human resource selection for business process exceptions
A key capability of today's organizations is to flexibly and effectively react to unexpected events. A critical case of an unexpected event is sudden unavailability of human resources, which was not properly addressed by existing resource allocation approaches. This paper proposes a systematic approach that analyzes event logs to select suitable substitutes if the initial human resources become unavailable. The approach uses process mining and social network analysis to derive a metric called degree of substitution, which measures how much the work experiences of the human resources overlap, from the two perspectives: task execution and transfer of work. Along with the metric, suitable substitutes are also identified. A simulation demonstrates that the approach identifies suitable substitutes more effectively and accurately than existing allocation methods such as role‐based allocation or random allocation. The proposed approach will increase the effectiveness of dynamic allocation of human resources, especially in an exceptional situation.11Yscopu
Learning Symmetrization for Equivariance with Orbit Distance Minimization
We present a general framework for symmetrizing an arbitrary neural-network
architecture and making it equivariant with respect to a given group. We build
upon the proposals of Kim et al. (2023); Kaba et al. (2023) for symmetrization,
and improve them by replacing their conversion of neural features into group
representations, with an optimization whose loss intuitively measures the
distance between group orbits. This change makes our approach applicable to a
broader range of matrix groups, such as the Lorentz group O(1, 3), than these
two proposals. We experimentally show our method's competitiveness on the SO(2)
image classification task, and also its increased generality on the task with
O(1, 3). Our implementation will be made accessible at
https://github.com/tiendatnguyen-vision/Orbit-symmetrize.Comment: 16 pages, 1 figur
Transformers meet Stochastic Block Models: Attention with Data-Adaptive Sparsity and Cost
To overcome the quadratic cost of self-attention, recent works have proposed
various sparse attention modules, most of which fall under one of two groups:
1) sparse attention under a hand-crafted patterns and 2) full attention
followed by a sparse variant of softmax such as -entmax. Unfortunately,
the first group lacks adaptability to data while the second still requires
quadratic cost in training. In this work, we propose SBM-Transformer, a model
that resolves both problems by endowing each attention head with a
mixed-membership Stochastic Block Model (SBM). Then, each attention head
data-adaptively samples a bipartite graph, the adjacency of which is used as an
attention mask for each input. During backpropagation, a straight-through
estimator is used to flow gradients beyond the discrete sampling step and
adjust the probabilities of sampled edges based on the predictive loss. The
forward and backward cost are thus linear to the number of edges, which each
attention head can also choose flexibly based on the input. By assessing the
distribution of graphs, we theoretically show that SBM-Transformer is a
universal approximator for arbitrary sequence-to-sequence functions in
expectation. Empirical evaluations under the LRA and GLUE benchmarks
demonstrate that our model outperforms previous efficient variants as well as
the original Transformer with full attention. Our implementation can be found
in https://github.com/sc782/SBM-Transformer .Comment: 19 pages, 8 figure
- …