8 research outputs found
Auxiliary Learning as an Asymmetric Bargaining Game
Auxiliary learning is an effective method for enhancing the generalization
capabilities of trained models, particularly when dealing with small datasets.
However, this approach may present several difficulties: (i) optimizing
multiple objectives can be more challenging, and (ii) how to balance the
auxiliary tasks to best assist the main task is unclear. In this work, we
propose a novel approach, named AuxiNash, for balancing tasks in auxiliary
learning by formalizing the problem as generalized bargaining game with
asymmetric task bargaining power. Furthermore, we describe an efficient
procedure for learning the bargaining power of tasks based on their
contribution to the performance of the main task and derive theoretical
guarantees for its convergence. Finally, we evaluate AuxiNash on multiple
multi-task benchmarks and find that it consistently outperforms competing
methods.Comment: ICML 202
Equivariant Architectures for Learning in Deep Weight Spaces
Designing machine learning architectures for processing neural networks in
their raw weight matrix form is a newly introduced research direction.
Unfortunately, the unique symmetry structure of deep weight spaces makes this
design very challenging. If successful, such architectures would be capable of
performing a wide range of intriguing tasks, from adapting a pre-trained
network to a new domain to editing objects represented as functions (INRs or
NeRFs). As a first step towards this goal, we present here a novel network
architecture for learning in deep weight spaces. It takes as input a
concatenation of weights and biases of a pre-trained MLP and processes it using
a composition of layers that are equivariant to the natural permutation
symmetry of the MLP's weights: Changing the order of neurons in intermediate
layers of the MLP does not affect the function it represents. We provide a full
characterization of all affine equivariant and invariant layers for these
symmetries and show how these layers can be implemented using three basic
operations: pooling, broadcasting, and fully connected layers applied to the
input in an appropriate manner. We demonstrate the effectiveness of our
architecture and its advantages over natural baselines in a variety of learning
tasks.Comment: ICML 202
DisCLIP: Open-Vocabulary Referring Expression Generation
Referring Expressions Generation (REG) aims to produce textual descriptions
that unambiguously identifies specific objects within a visual scene.
Traditionally, this has been achieved through supervised learning methods,
which perform well on specific data distributions but often struggle to
generalize to new images and concepts. To address this issue, we present a
novel approach for REG, named DisCLIP, short for discriminative CLIP. We build
on CLIP, a large-scale visual-semantic model, to guide an LLM to generate a
contextual description of a target concept in an image while avoiding other
distracting concepts. Notably, this optimization happens at inference time and
does not require additional training or tuning of learned parameters. We
measure the quality of the generated text by evaluating the capability of a
receiver model to accurately identify the described object within the scene. To
achieve this, we use a frozen zero-shot comprehension module as a critique of
our generated referring expressions. We evaluate DisCLIP on multiple referring
expression benchmarks through human evaluation and show that it significantly
outperforms previous methods on out-of-domain datasets. Our results highlight
the potential of using pre-trained visual-semantic models for generating
high-quality contextual descriptions
COMPOSER: Compositional Reasoning of Group Activity in Videos with Keypoint-Only Modality
Group Activity Recognition detects the activity collectively performed by a
group of actors, which requires compositional reasoning of actors and objects.
We approach the task by modeling the video as tokens that represent the
multi-scale semantic concepts in the video. We propose COMPOSER, a Multiscale
Transformer based architecture that performs attention-based reasoning over
tokens at each scale and learns group activity compositionally. In addition,
prior works suffer from scene biases with privacy and ethical concerns. We only
use the keypoint modality which reduces scene biases and prevents acquiring
detailed visual data that may contain private or biased information of users.
We improve the multiscale representations in COMPOSER by clustering the
intermediate scale representations, while maintaining consistent cluster
assignments between scales. Finally, we use techniques such as auxiliary
prediction and data augmentations tailored to the keypoint signals to aid model
training. We demonstrate the model's strength and interpretability on two
widely-used datasets (Volleyball and Collective Activity). COMPOSER achieves up
to +5.4% improvement with just the keypoint modality. Code is available at
https://github.com/hongluzhou/composerComment: ECCV 202
Data Augmentations in Deep Weight Spaces
Learning in weight spaces, where neural networks process the weights of other
deep neural networks, has emerged as a promising research direction with
applications in various fields, from analyzing and editing neural fields and
implicit neural representations, to network pruning and quantization. Recent
works designed architectures for effective learning in that space, which takes
into account its unique, permutation-equivariant, structure. Unfortunately, so
far these architectures suffer from severe overfitting and were shown to
benefit from large datasets. This poses a significant challenge because
generating data for this learning setup is laborious and time-consuming since
each data sample is a full set of network weights that has to be trained. In
this paper, we address this difficulty by investigating data augmentations for
weight spaces, a set of techniques that enable generating new data examples on
the fly without having to train additional input weight space elements. We
first review several recently proposed data augmentation schemes %that were
proposed recently and divide them into categories. We then introduce a novel
augmentation scheme based on the Mixup method. We evaluate the performance of
these techniques on existing benchmarks as well as new benchmarks we generate,
which can be valuable for future studies.Comment: Accepted to NeurIPS 2023 Workshop on Symmetry and Geometry in Neural
Representation
Multi-Task Learning as a Bargaining Game
In Multi-task learning (MTL), a joint model is trained to simultaneously make
predictions for several tasks. Joint training reduces computation costs and
improves data efficiency; however, since the gradients of these different tasks
may conflict, training a joint model for MTL often yields lower performance
than its corresponding single-task counterparts. A common method for
alleviating this issue is to combine per-task gradients into a joint update
direction using a particular heuristic. In this paper, we propose viewing the
gradients combination step as a bargaining game, where tasks negotiate to reach
an agreement on a joint direction of parameter update. Under certain
assumptions, the bargaining problem has a unique solution, known as the Nash
Bargaining Solution, which we propose to use as a principled approach to
multi-task learning. We describe a new MTL optimization procedure, Nash-MTL,
and derive theoretical guarantees for its convergence. Empirically, we show
that Nash-MTL achieves state-of-the-art results on multiple MTL benchmarks in
various domains.Comment: ICML 202