94 research outputs found
Prototype-based HyperAdapter for Sample-Efficient Multi-task Tuning
Parameter-efficient fine-tuning (PEFT) has shown its effectiveness in
adapting the pre-trained language models to downstream tasks while only
updating a small number of parameters. Despite the success, most existing
methods independently adapt to each task without considering knowledge transfer
between tasks and are limited to low-data regimes. To overcome this issue, we
propose Prototype-based HyperAdapter (PHA), a novel framework built on the
adapter-tuning and hypernetwork. It introduces an instance-dense retriever and
a prototypical hypernetwork to generate the conditional modules in a
sample-efficient manner. This leads to comparable performance improvements
against existing PEFT methods on multi-task learning and few-shot transfer
learning. More importantly, when the available data size gets smaller, our
method outperforms other strong baselines by a large margin. Based on our
extensive empirical experiments across various datasets, we demonstrate that
PHA strikes a better trade-off between trainable parameters, accuracy on stream
tasks, and sample efficiency.Comment: Accepted by EMNLP 202
Pluralistic Aging Diffusion Autoencoder
Face aging is an ill-posed problem because multiple plausible aging patterns
may correspond to a given input. Most existing methods often produce one
deterministic estimation. This paper proposes a novel CLIP-driven Pluralistic
Aging Diffusion Autoencoder (PADA) to enhance the diversity of aging patterns.
First, we employ diffusion models to generate diverse low-level aging details
via a sequential denoising reverse process. Second, we present Probabilistic
Aging Embedding (PAE) to capture diverse high-level aging patterns, which
represents age information as probabilistic distributions in the common CLIP
latent space. A text-guided KL-divergence loss is designed to guide this
learning. Our method can achieve pluralistic face aging conditioned on
open-world aging texts and arbitrary unseen face images. Qualitative and
quantitative experiments demonstrate that our method can generate more diverse
and high-quality plausible aging results.Comment: Accepted by ICCV 202
Towards Spatio-temporal Sea Surface Temperature Forecasting via Static and Dynamic Learnable Personalized Graph Convolution Network
Sea surface temperature (SST) is uniquely important to the Earth's atmosphere
since its dynamics are a major force in shaping local and global climate and
profoundly affect our ecosystems. Accurate forecasting of SST brings
significant economic and social implications, for example, better preparation
for extreme weather such as severe droughts or tropical cyclones months ahead.
However, such a task faces unique challenges due to the intrinsic complexity
and uncertainty of ocean systems. Recently, deep learning techniques, such as
graphical neural networks (GNN), have been applied to address this task. Even
though these methods have some success, they frequently have serious drawbacks
when it comes to investigating dynamic spatiotemporal dependencies between
signals. To solve this problem, this paper proposes a novel static and dynamic
learnable personalized graph convolution network (SD-LPGC). Specifically, two
graph learning layers are first constructed to respectively model the stable
long-term and short-term evolutionary patterns hidden in the multivariate SST
signals. Then, a learnable personalized convolution layer is designed to fuse
this information. Our experiments on real SST datasets demonstrate the
state-of-the-art performances of the proposed approach on the forecasting task
Learning-to-Rank Meets Language: Boosting Language-Driven Ordering Alignment for Ordinal Classification
We present a novel language-driven ordering alignment method for ordinal
classification. The labels in ordinal classification contain additional
ordering relations, making them prone to overfitting when relying solely on
training data. Recent developments in pre-trained vision-language models
inspire us to leverage the rich ordinal priors in human language by converting
the original task into a visionlanguage alignment task. Consequently, we
propose L2RCLIP, which fully utilizes the language priors from two
perspectives. First, we introduce a complementary prompt tuning technique
called RankFormer, designed to enhance the ordering relation of original rank
prompts. It employs token-level attention with residual-style prompt blending
in the word embedding space. Second, to further incorporate language priors, we
revisit the approximate bound optimization of vanilla cross-entropy loss and
restructure it within the cross-modal embedding space. Consequently, we propose
a cross-modal ordinal pairwise loss to refine the CLIP feature space, where
texts and images maintain both semantic alignment and ordering alignment.
Extensive experiments on three ordinal classification tasks, including facial
age estimation, historical color image (HCI) classification, and aesthetic
assessment demonstrate its promising performance. The code is available at
https://github.com/raywang335/L2RCLIP.Comment: Accepted by NeurIPS 202
CHATEDIT: Towards Multi-turn Interactive Facial Image Editing via Dialogue
This paper explores interactive facial image editing via dialogue and
introduces the ChatEdit benchmark dataset for evaluating image editing and
conversation abilities in this context. ChatEdit is constructed from the
CelebA-HQ dataset, incorporating annotated multi-turn dialogues corresponding
to user edit requests on the images. The dataset is challenging, as it requires
the system to dynamically track user requests, edit images, and generate
appropriate responses. Accordingly, we propose three benchmark tasks: (i) user
edit request tracking, (ii) image editing, and (iii) response generation. We
present a novel baseline framework that integrates a dialogue module for both
tracking user requests and generating responses and an image editing module for
image editing. Unlike previous approaches, our framework directly tracks user
edit requests from the entire dialogue history up to the current turn and
modifies the original image rather than adjusting the previous turn's output,
thereby reducing error accumulation and preventing attribute forgetfulness.
Extensive experiments on the ChatEdit dataset underline our framework's
superior performance against prior models, while also highlighting potential
room for further research. We will release the code and data publicly to
facilitate advancements in complex interactive facial image editing.Comment: Accepted to EMNLP 2023 (Main Conference
MORE-3S:Multimodal-based Offline Reinforcement Learning with Shared Semantic Spaces
Drawing upon the intuition that aligning different modalities to the same
semantic embedding space would allow models to understand states and actions
more easily, we propose a new perspective to the offline reinforcement learning
(RL) challenge. More concretely, we transform it into a supervised learning
task by integrating multimodal and pre-trained language models. Our approach
incorporates state information derived from images and action-related data
obtained from text, thereby bolstering RL training performance and promoting
long-term strategic thinking. We emphasize the contextual understanding of
language and demonstrate how decision-making in RL can benefit from aligning
states' and actions' representation with languages' representation. Our method
significantly outperforms current baselines as evidenced by evaluations
conducted on Atari and OpenAI Gym environments. This contributes to advancing
offline RL performance and efficiency while providing a novel perspective on
offline RL.Our code and data are available at
https://github.com/Zheng0428/MORE_
Deep Reinforcement Learning with Multitask Episodic Memory Based on Task-Conditioned Hypernetwork
Deep reinforcement learning algorithms are usually impeded by sampling
inefficiency, heavily depending on multiple interactions with the environment
to acquire accurate decision-making capabilities. In contrast, humans rely on
their hippocampus to retrieve relevant information from past experiences of
relevant tasks, which guides their decision-making when learning a new task,
rather than exclusively depending on environmental interactions. Nevertheless,
designing a hippocampus-like module for an agent to incorporate past
experiences into established reinforcement learning algorithms presents two
challenges. The first challenge involves selecting the most relevant past
experiences for the current task, and the second challenge is integrating such
experiences into the decision network. To address these challenges, we propose
a novel method that utilizes a retrieval network based on task-conditioned
hypernetwork, which adapts the retrieval network's parameters depending on the
task. At the same time, a dynamic modification mechanism enhances the
collaborative efforts between the retrieval and decision networks. We evaluate
the proposed method on the MiniGrid environment.The experimental results
demonstrate that our proposed method significantly outperforms strong
baselines
- …