2,023 research outputs found
BadPrompt: Backdoor Attacks on Continuous Prompts
The prompt-based learning paradigm has gained much research attention
recently. It has achieved state-of-the-art performance on several NLP tasks,
especially in the few-shot scenarios. While steering the downstream tasks, few
works have been reported to investigate the security problems of the
prompt-based models. In this paper, we conduct the first study on the
vulnerability of the continuous prompt learning algorithm to backdoor attacks.
We observe that the few-shot scenarios have posed a great challenge to backdoor
attacks on the prompt-based models, limiting the usability of existing NLP
backdoor methods. To address this challenge, we propose BadPrompt, a
lightweight and task-adaptive algorithm, to backdoor attack continuous prompts.
Specially, BadPrompt first generates candidate triggers which are indicative
for predicting the targeted label and dissimilar to the samples of the
non-targeted labels. Then, it automatically selects the most effective and
invisible trigger for each sample with an adaptive trigger optimization
algorithm. We evaluate the performance of BadPrompt on five datasets and two
continuous prompt models. The results exhibit the abilities of BadPrompt to
effectively attack continuous prompts while maintaining high performance on the
clean test sets, outperforming the baseline models by a large margin. The
source code of BadPrompt is publicly available at
https://github.com/papersPapers/BadPrompt.Comment: Accepted at NeurIPS 202
A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future
As the most fundamental tasks of computer vision, object detection and
segmentation have made tremendous progress in the deep learning era. Due to the
expensive manual labeling, the annotated categories in existing datasets are
often small-scale and pre-defined, i.e., state-of-the-art detectors and
segmentors fail to generalize beyond the closed-vocabulary. To resolve this
limitation, the last few years have witnessed increasing attention toward
Open-Vocabulary Detection (OVD) and Segmentation (OVS). In this survey, we
provide a comprehensive review on the past and recent development of OVD and
OVS. To this end, we develop a taxonomy according to the type of task and
methodology. We find that the permission and usage of weak supervision signals
can well discriminate different methodologies, including: visual-semantic space
mapping, novel visual feature synthesis, region-aware training,
pseudo-labeling, knowledge distillation-based, and transfer learning-based. The
proposed taxonomy is universal across different tasks, covering object
detection, semantic/instance/panoptic segmentation, 3D scene and video
understanding. In each category, its main principles, key challenges,
development routes, strengths, and weaknesses are thoroughly discussed. In
addition, we benchmark each task along with the vital components of each
method. Finally, several promising directions are provided to stimulate future
research
Improving Small Footprint Few-shot Keyword Spotting with Supervision on Auxiliary Data
Few-shot keyword spotting (FS-KWS) models usually require large-scale
annotated datasets to generalize to unseen target keywords. However, existing
KWS datasets are limited in scale and gathering keyword-like labeled data is
costly undertaking. To mitigate this issue, we propose a framework that uses
easily collectible, unlabeled reading speech data as an auxiliary source.
Self-supervised learning has been widely adopted for learning representations
from unlabeled data; however, it is known to be suitable for large models with
enough capacity and is not practical for training a small footprint FS-KWS
model. Instead, we automatically annotate and filter the data to construct a
keyword-like dataset, LibriWord, enabling supervision on auxiliary data. We
then adopt multi-task learning that helps the model to enhance the
representation power from out-of-domain auxiliary data. Our method notably
improves the performance over competitive methods in the FS-KWS benchmark.Comment: Interspeech 202
NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding
Research on depth-based human activity analysis achieved outstanding
performance and demonstrated the effectiveness of 3D representation for action
recognition. The existing depth-based and RGB+D-based action recognition
benchmarks have a number of limitations, including the lack of large-scale
training samples, realistic number of distinct class categories, diversity in
camera views, varied environmental conditions, and variety of human subjects.
In this work, we introduce a large-scale dataset for RGB+D human action
recognition, which is collected from 106 distinct subjects and contains more
than 114 thousand video samples and 8 million frames. This dataset contains 120
different action classes including daily, mutual, and health-related
activities. We evaluate the performance of a series of existing 3D activity
analysis methods on this dataset, and show the advantage of applying deep
learning methods for 3D-based human action recognition. Furthermore, we
investigate a novel one-shot 3D activity recognition problem on our dataset,
and a simple yet effective Action-Part Semantic Relevance-aware (APSR)
framework is proposed for this task, which yields promising results for
recognition of the novel action classes. We believe the introduction of this
large-scale dataset will enable the community to apply, adapt, and develop
various data-hungry learning techniques for depth-based and RGB+D-based human
activity understanding. [The dataset is available at:
http://rose1.ntu.edu.sg/Datasets/actionRecognition.asp]Comment: IEEE Transactions on Pattern Analysis and Machine Intelligence
(TPAMI
ConES: Concept Embedding Search for Parameter Efficient Tuning Large Vision Language Models
Large pre-trained vision-language models have shown great prominence in
transferring pre-acquired knowledge to various domains and downstream tasks
with appropriate prompting or tuning. Existing prevalent tuning methods can be
generally categorized into three genres: 1) prompt engineering by creating
suitable prompt texts, which is time-consuming and requires domain expertise;
2) or simply fine-tuning the whole model, which is extremely inefficient; 3)
prompt tuning through parameterized prompt embeddings with the text encoder.
Nevertheless, all methods rely on the text encoder for bridging the modality
gap between vision and language. In this work, we question the necessity of the
cumbersome text encoder for a more lightweight and efficient tuning paradigm as
well as more representative prompt embeddings closer to the image
representations. To achieve this, we propose a Concept Embedding Search (ConES)
approach by optimizing prompt embeddings -- without the need of the text
encoder -- to capture the 'concept' of the image modality through a variety of
task objectives. By dropping the text encoder, we are able to significantly
speed up the learning process, \eg, from about an hour to just ten minutes in
our experiments for personalized text-to-image generation without impairing the
generation quality. Moreover, our proposed approach is orthogonal to current
existing tuning methods since the searched concept embeddings can be further
utilized in the next stage of fine-tuning the pre-trained large models for
boosting performance. Extensive experiments show that our approach can beat the
prompt tuning and textual inversion methods in a variety of downstream tasks
including objection detection, instance segmentation, and image generation. Our
approach also shows better generalization capability for unseen concepts in
specialized domains, such as the medical domain
Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models
When trained at a sufficient scale, self-supervised learning has exhibited a
notable ability to solve a wide range of visual or language understanding
tasks. In this paper, we investigate simple, yet effective approaches for
adapting the pre-trained foundation models to the downstream task of interest,
namely, open-vocabulary semantic segmentation. To this end, we make the
following contributions: (i) we introduce Fusioner, with a lightweight,
transformer-based fusion module, that pairs the frozen visual representation
with language concept through a handful of image segmentation data. As a
consequence, the model gains the capability of zero-shot transfer to segment
novel categories; (ii) without loss of generality, we experiment on a broad
range of self-supervised models that have been pre-trained with different
schemes, e.g. visual-only models (MoCo v3, DINO), language-only models (BERT),
visual-language model (CLIP), and show that, the proposed fusion approach is
effective to any pair of visual and language models, even those pre-trained on
a corpus of uni-modal data; (iii) we conduct thorough ablation studies to
analyze the critical components in our proposed Fusioner, while evaluating on
standard benchmarks, e.g. PASCAL-5i and COCO-20i , it surpasses existing
state-of-the-art models by a large margin, despite only being trained on frozen
visual and language features; (iv) to measure the model's robustness on
learning visual-language correspondence, we further evaluate on synthetic
dataset, named Mosaic-4, where images are constructed by mosaicking the samples
from FSS-1000. Fusioner demonstrates superior performance over previous models.Comment: BMVC 2022 Ora
CT-BERT: Learning Better Tabular Representations Through Cross-Table Pre-training
Tabular data -- also known as structured data -- is one of the most common
data forms in existence, thanks to the stable development and scaled deployment
of database systems in the last few decades. At present however, despite the
blast brought by large pre-trained models in other domains such as ChatGPT or
SAM, how can we extract common knowledge across tables at a scale that may
eventually lead to generalizable representation for tabular data remains a full
blank. Indeed, there have been a few works around this topic. Most (if not all)
of them are limited in the scope of a single table or fixed form of a schema.
In this work, we first identify the crucial research challenges behind tabular
data pre-training, particularly towards the cross-table scenario. We position
the contribution of this work in two folds: (i)-we collect and curate nearly 2k
high-quality tabular datasets, each of which is guaranteed to possess clear
semantics, clean labels, and other necessary meta information. (ii)-we propose
a novel framework that allows cross-table pre-training dubbed as CT-BERT.
Noticeably, in light of pioneering the scaled cross-table training, CT-BERT is
fully compatible with both supervised and self-supervised schemes, where the
specific instantiation of CT-BERT is very much dependent on the downstream
tasks. We further propose and implement a contrastive-learning-based and masked
table modeling (MTM) objective into CT-BERT, that is inspired from computer
vision and natural language processing communities but sophistically tailored
to tables. The extensive empirical results on 15 datasets demonstrate CT-BERT's
state-of-the-art performance, where both its supervised and self-supervised
setups significantly outperform the prior approaches
PRE: Vision-Language Prompt Learning with Reparameterization Encoder
Large pre-trained vision-language models such as CLIP have demonstrated great
potential in zero-shot transferability to downstream tasks. However, to attain
optimal performance, the manual selection of prompts is necessary to improve
alignment between the downstream image distribution and the textual class
descriptions. This manual prompt engineering is the major challenge for
deploying such models in practice since it requires domain expertise and is
extremely time-consuming. To avoid non-trivial prompt engineering, recent work
Context Optimization (CoOp) introduced the concept of prompt learning to the
vision domain using learnable textual tokens. While CoOp can achieve
substantial improvements over manual prompts, its learned context is worse
generalizable to wider unseen classes within the same dataset. In this work, we
present Prompt Learning with Reparameterization Encoder (PRE) - a simple and
efficient method that enhances the generalization ability of the learnable
prompt to unseen classes while maintaining the capacity to learn Base classes.
Instead of directly optimizing the prompts, PRE employs a prompt encoder to
reparameterize the input prompt embeddings, enhancing the exploration of
task-specific knowledge from few-shot samples. Experiments and extensive
ablation studies on 8 benchmarks demonstrate that our approach is an efficient
method for prompt learning. Specifically, PRE achieves a notable enhancement of
5.60% in average accuracy on New classes and 3% in Harmonic mean compared to
CoOp in the 16-shot setting, all achieved within a good training time.Comment: 8 pages excluding References and Appendi
SciRepEval: A Multi-Format Benchmark for Scientific Document Representations
Learned representations of scientific documents can serve as valuable input
features for downstream tasks, without the need for further fine-tuning.
However, existing benchmarks for evaluating these representations fail to
capture the diversity of relevant tasks. In response, we introduce SciRepEval,
the first comprehensive benchmark for training and evaluating scientific
document representations. It includes 25 challenging and realistic tasks, 11 of
which are new, across four formats: classification, regression, ranking and
search. We then use the benchmark to study and improve the generalization
ability of scientific document representation models. We show how
state-of-the-art models struggle to generalize across task formats, and that
simple multi-task training fails to improve them. However, a new approach that
learns multiple embeddings per document, each tailored to a different format,
can improve performance. We experiment with task-format-specific control codes
and adapters in a multi-task setting and find that they outperform the existing
single-embedding state-of-the-art by up to 1.5 points absolute.Comment: 21 pages, 2 figures, 9 tables. For associated code, see
https://github.com/allenai/scirepeva
- …