109 research outputs found
One Fits All:Power General Time Series Analysis by Pretrained LM
Although we have witnessed great success of pre-trained models in natural
language processing (NLP) and computer vision (CV), limited progress has been
made for general time series analysis. Unlike NLP and CV where a unified model
can be used to perform different tasks, specially designed approach still
dominates in each time series analysis task such as classification, anomaly
detection, forecasting, and few-shot learning. The main challenge that blocks
the development of pre-trained model for time series analysis is the lack of a
large amount of data for training. In this work, we address this challenge by
leveraging language or CV models, pre-trained from billions of tokens, for time
series analysis. Specifically, we refrain from altering the self-attention and
feedforward layers of the residual blocks in the pre-trained language or image
model. This model, known as the Frozen Pretrained Transformer (FPT), is
evaluated through fine-tuning on all major types of tasks involving time
series. Our results demonstrate that pre-trained models on natural language or
images can lead to a comparable or state-of-the-art performance in all main
time series analysis tasks, as illustrated in Figure 1. We also found both
theoretically and empirically that the self-attention module behaviors
similarly to principle component analysis (PCA), an observation that helps
explains how transformer bridges the domain gap and a crucial step towards
understanding the universality of a pre-trained transformer.The code is
publicly available at https://github.com/DAMO-DI-ML/One_Fits_All.Comment: Neurips 2023 Spotligh
Many Heads but One Brain: Fusion Brain -- a Competition and a Single Multimodal Multitask Architecture
Supporting the current trend in the AI community, we present the AI Journey
2021 Challenge called Fusion Brain, the first competition which is targeted to
make the universal architecture which could process different modalities (in
this case, images, texts, and code) and solve multiple tasks for vision and
language. The Fusion Brain Challenge combines the following specific tasks:
Code2code Translation, Handwritten Text recognition, Zero-shot Object
Detection, and Visual Question Answering. We have created datasets for each
task to test the participants' submissions on it. Moreover, we have collected
and made publicly available a new handwritten dataset in both English and
Russian, which consists of 94,128 pairs of images and texts. We also propose a
multimodal and multitask architecture - a baseline solution, in the center of
which is a frozen foundation model and which has been trained in Fusion mode
along with Single-task mode. The proposed Fusion approach proves to be
competitive and more energy-efficient compared to the task-specific one
Pre-trained transformer for adversarial purification
With more and more deep neural networks being deployed as various daily
services, their reliability is essential. It's frightening that deep neural
networks are vulnerable and sensitive to adversarial attacks, the most common
one of which for the services is evasion-based. Recent works usually strengthen
the robustness by adversarial training or leveraging the knowledge of an amount
of clean data. However, in practical terms, retraining and redeploying the
model need a large computational budget, leading to heavy losses to the online
service. In addition, when adversarial examples of a certain attack are
detected, only limited adversarial examples are available for the service
provider, while much clean data may not be accessible. Given the mentioned
problems, we propose a new scenario, RaPiD (Rapid Plug-in Defender), which is
to rapidly defend against a certain attack for the frozen original service
model with limitations of few clean and adversarial examples. Motivated by the
generalization and the universal computation ability of pre-trained transformer
models, we come up with a new defender method, CeTaD, which stands for
Considering Pre-trained Transformers as Defenders. In particular, we evaluate
the effectiveness and the transferability of CeTaD in the case of one-shot
adversarial examples and explore the impact of different parts of CeTaD as well
as training data conditions. CeTaD is flexible, able to be embedded into an
arbitrary differentiable model, and suitable for various types of attacks
Pretrain on just structure: Understanding linguistic inductive biases using transfer learning
Both humans and transformer language models are able to learn language
without explicit structural supervision. What inductive learning biases make
this learning possible? In this study, we examine the effect of different
inductive learning biases by predisposing language models with structural
biases through pretraining on artificial structured data, and then evaluating
by fine-tuning on English. Our experimental setup gives us the ability to
actively control the inductive bias of language models. With our experiments,
we investigate the comparative success of three types of inductive bias: 1) an
inductive bias for recursive, hierarchical processing 2) an inductive bias for
unrestricted token-token dependencies that can't be modeled by context-free
grammars, and 3) an inductive bias for a Zipfian power-law vocabulary
distribution. We show that complex token-token interactions form the best
inductive biases, and that this is strongest in the non-context-free case. We
also show that a Zipfian vocabulary distribution forms a good inductive bias
independently from grammatical structure. Our study leverages the capabilities
of transformer models to run controlled language learning experiments that are
not possible to run in humans, and surfaces hypotheses about the structures
that facilitate language learning in both humans and machines
Linearly Mapping from Image to Text Space
The extent to which text-only language models (LMs) learn to represent the
physical, non-linguistic world is an open question. Prior work has shown that
pretrained LMs can be taught to ``understand'' visual inputs when the models'
parameters are updated on image captioning tasks. We test a stronger
hypothesis: that the conceptual representations learned by text-only models are
functionally equivalent (up to a linear transformation) to those learned by
models trained on vision tasks. Specifically, we show that the image
representations from vision models can be transferred as continuous prompts to
frozen LMs by training only a single linear projection. Using these to prompt
the LM achieves competitive performance on captioning and visual question
answering tasks compared to models that tune both the image encoder and text
decoder (such as the MAGMA model). We compare three image encoders with
increasing amounts of linguistic supervision seen during pretraining: BEIT (no
linguistic information), NF-ResNET (lexical category information), and CLIP
(full natural language descriptions). We find that all three encoders perform
equally well at transferring visual property information to the language model
(e.g., whether an animal is large or small), but that image encoders pretrained
with linguistic supervision more saliently encode category information (e.g.,
distinguishing hippo vs.\ elephant) and thus perform significantly better on
benchmark language-and-vision tasks. Our results indicate that LMs encode
conceptual information structurally similarly to vision-based models, even
those that are solely trained on images
V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models
Building artificial intelligence (AI) systems on top of a set of foundation
models (FMs) is becoming a new paradigm in AI research. Their representative
and generative abilities learnt from vast amounts of data can be easily adapted
and transferred to a wide range of downstream tasks without extra training from
scratch. However, leveraging FMs in cross-modal generation remains
under-researched when audio modality is involved. On the other hand,
automatically generating semantically-relevant sound from visual input is an
important problem in cross-modal generation studies. To solve this
vision-to-audio (V2A) generation problem, existing methods tend to design and
build complex systems from scratch using modestly sized datasets. In this
paper, we propose a lightweight solution to this problem by leveraging
foundation models, specifically CLIP, CLAP, and AudioLDM. We first investigate
the domain gap between the latent space of the visual CLIP and the auditory
CLAP models. Then we propose a simple yet effective mapper mechanism
(V2A-Mapper) to bridge the domain gap by translating the visual input between
CLIP and CLAP spaces. Conditioned on the translated CLAP embedding, pretrained
audio generative FM AudioLDM is adopted to produce high-fidelity and
visually-aligned sound. Compared to previous approaches, our method only
requires a quick training of the V2A-Mapper. We further analyze and conduct
extensive experiments on the choice of the V2A-Mapper and show that a
generative mapper is better at fidelity and variability (FD) while a regression
mapper is slightly better at relevance (CS). Both objective and subjective
evaluation on two V2A datasets demonstrate the superiority of our proposed
method compared to current state-of-the-art approaches - trained with 86% fewer
parameters but achieving 53% and 19% improvement in FD and CS, respectively.Comment: 13 pages, 10 figures. Demo page: https://v2a-mapper.github.io
Grounding Language Models to Images for Multimodal Inputs and Outputs
We propose an efficient method to ground pretrained text-only language models
to the visual domain, enabling them to process arbitrarily interleaved
image-and-text data, and generate text interleaved with retrieved images. Our
method leverages the abilities of language models learnt from large scale
text-only pretraining, such as in-context learning and free-form text
generation. We keep the language model frozen, and finetune input and output
linear layers to enable cross-modality interactions. This allows our model to
process arbitrarily interleaved image-and-text inputs, and generate free-form
text interleaved with retrieved images. We achieve strong zero-shot performance
on grounded tasks such as contextual image retrieval and multimodal dialogue,
and showcase compelling interactive abilities. Our approach works with any
off-the-shelf language model and paves the way towards an effective, general
solution for leveraging pretrained language models in visually grounded
settings.Comment: Published in ICML 2023. Project page: https://jykoh.com/fromag
Merging Decision Transformers: Weight Averaging for Forming Multi-Task Policies
Recent work has shown the promise of creating generalist, transformer-based,
models for language, vision, and sequential decision-making problems. To create
such models, we generally require centralized training objectives, data, and
compute. It is of interest if we can more flexibly create generalist policies
by merging together multiple, task-specific, individually trained policies. In
this work, we take a preliminary step in this direction through merging, or
averaging, subsets of Decision Transformers in parameter space trained on
different MuJoCo locomotion problems, forming multi-task models without
centralized training. We also demonstrate the importance of various
methodological choices when merging policies, such as utilizing common
pre-trained initializations, increasing model capacity, and utilizing Fisher
information for weighting parameter importance. In general, we believe research
in this direction could help democratize and distribute the process that forms
multi-task robotics policies. Our implementation is available at
https://github.com/daniellawson9999/merging-decision-transformers
- …