59,594 research outputs found
HoME: a Household Multimodal Environment
We introduce HoME: a Household Multimodal Environment for artificial agents
to learn from vision, audio, semantics, physics, and interaction with objects
and other agents, all within a realistic context. HoME integrates over 45,000
diverse 3D house layouts based on the SUNCG dataset, a scale which may
facilitate learning, generalization, and transfer. HoME is an open-source,
OpenAI Gym-compatible platform extensible to tasks in reinforcement learning,
language grounding, sound-based navigation, robotics, multi-agent learning, and
more. We hope HoME better enables artificial agents to learn as humans do: in
an interactive, multimodal, and richly contextualized setting.Comment: Presented at NIPS 2017's Visually-Grounded Interaction and Language
Worksho
CLiMB: A Continual Learning Benchmark for Vision-and-Language Tasks
Current state-of-the-art vision-and-language models are evaluated on tasks
either individually or in a multi-task setting, overlooking the challenges of
continually learning (CL) tasks as they arrive. Existing CL benchmarks have
facilitated research on task adaptation and mitigating "catastrophic
forgetting", but are limited to vision-only and language-only tasks. We present
CLiMB, a benchmark to study the challenge of learning multimodal tasks in a CL
setting, and to systematically evaluate how upstream continual learning can
rapidly generalize to new multimodal and unimodal tasks. CLiMB includes
implementations of several CL algorithms and a modified Vision-Language
Transformer (ViLT) model that can be deployed on both multimodal and unimodal
tasks. We find that common CL methods can help mitigate forgetting during
multimodal task learning, but do not enable cross-task knowledge transfer. We
envision that CLiMB will facilitate research on a new class of CL algorithms
for this challenging multimodal setting.Comment: Accepted to NeurIPS 2022 Datasets and Benchmarks trac
CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation
We present CoDi-2, a versatile and interactive Multimodal Large Language
Model (MLLM) that can follow complex multimodal interleaved instructions,
conduct in-context learning (ICL), reason, chat, edit, etc., in an any-to-any
input-output modality paradigm. By aligning modalities with language for both
encoding and generation, CoDi-2 empowers Large Language Models (LLMs) to not
only understand complex modality-interleaved instructions and in-context
examples, but also autoregressively generate grounded and coherent multimodal
outputs in the continuous feature space. To train CoDi-2, we build a
large-scale generation dataset encompassing in-context multimodal instructions
across text, vision, and audio. CoDi-2 demonstrates a wide range of zero-shot
capabilities for multimodal generation, such as in-context learning, reasoning,
and compositionality of any-to-any modality generation through multi-round
interactive conversation. CoDi-2 surpasses previous domain-specific models on
tasks such as subject-driven image generation, vision transformation, and audio
editing. CoDi-2 signifies a substantial breakthrough in developing a
comprehensive multimodal foundation model adept at interpreting in-context
language-vision-audio interleaved instructions and producing multimodal
outputs.Comment: Project Page: https://codi-2.github.io
MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning
Multimodal representation learning has shown promising improvements on
various vision-language tasks. Most existing methods excel at building
global-level alignment between vision and language while lacking effective
fine-grained image-text interaction. In this paper, we propose a jointly masked
multimodal modeling method to learn fine-grained multimodal representations.
Our method performs joint masking on image-text input and integrates both
implicit and explicit targets for the masked signals to recover. The implicit
target provides a unified and debiased objective for vision and language, where
the model predicts latent multimodal representations of the unmasked input. The
explicit target further enriches the multimodal representations by recovering
high-level and semantically meaningful information: momentum visual features of
image patches and concepts of word tokens. Through such a masked modeling
process, our model not only learns fine-grained multimodal interaction, but
also avoids the semantic gap between high-level representations and low- or
mid-level prediction targets (e.g. image pixels), thus producing semantically
rich multimodal representations that perform well on both zero-shot and
fine-tuned settings. Our pre-trained model (named MAMO) achieves
state-of-the-art performance on various downstream vision-language tasks,
including image-text retrieval, visual question answering, visual reasoning,
and weakly-supervised visual grounding
Dynamic Transformer Architecture for Continual Learning of Multimodal Tasks
Transformer neural networks are increasingly replacing prior architectures in
a wide range of applications in different data modalities. The increasing size
and computational demands of fine-tuning large pre-trained transformer neural
networks pose significant challenges for the widespread adoption of these
models for applications that demand on-edge computing. To tackle this
challenge, continual learning (CL) emerges as a solution by facilitating the
transfer of knowledge across tasks that arrive sequentially for an autonomously
learning agent. However, current CL methods mainly focus on learning tasks that
are exclusively vision-based or language-based. We propose a transformer-based
CL framework focusing on learning tasks that involve both vision and language,
known as Vision-and-Language (VaL) tasks. Due to the success of transformers in
other modalities, our architecture has the potential to be used in multimodal
learning settings. In our framework, we benefit from introducing extra
parameters to a base transformer to specialize the network for each task. As a
result, we enable dynamic model expansion to learn several tasks in a sequence.
We also use knowledge distillation to benefit from relevant past experiences to
learn the current task more efficiently. Our proposed method, Task Attentive
Multimodal Continual Learning (TAM-CL), allows for the exchange of information
between tasks while mitigating the problem of catastrophic forgetting. Notably,
our approach is scalable, incurring minimal memory and time overhead. TAM-CL
achieves state-of-the-art (SOTA) performance on challenging multimodal task
Multimodal Few-Shot Object Detection with Meta-Learning Based Cross-Modal Prompting
We study multimodal few-shot object detection (FSOD) in this paper, using
both few-shot visual examples and class semantic information for detection.
Most of previous works focus on either few-shot or zero-shot object detection,
ignoring the complementarity of visual and semantic information. We first show
that meta-learning and prompt-based learning, the most commonly-used methods
for few-shot learning and zero-shot transferring from pre-trained
vision-language models to downstream tasks, are conceptually similar. They both
reformulate the objective of downstream tasks the same as the pre-training
tasks, and mostly without tuning the parameters of pre-trained models. Based on
this observation, we propose to combine meta-learning with prompt-based
learning for multimodal FSOD without fine-tuning, by learning transferable
class-agnostic multimodal FSOD models over many-shot base classes.
Specifically, to better exploit the pre-trained vision-language models, the
meta-learning based cross-modal prompting is proposed to generate soft prompts
and further used to extract the semantic prototype, conditioned on the few-shot
visual examples. Then, the extracted semantic prototype and few-shot visual
prototype are fused to generate the multimodal prototype for detection. Our
models can efficiently fuse the visual and semantic information at both
token-level and feature-level. We comprehensively evaluate the proposed
multimodal FSOD models on multiple few-shot object detection benchmarks,
achieving promising results.Comment: 22 page
동적 멀티모달 데이터 학습을 위한 심층 하이퍼네트워크
학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2015. 2. 장병탁.Recent advancements in information communication technology has led the explosive increase of data. Dissimilar to traditional data which are structured and unimodal, in particular, the characteristics of recent data generated from dynamic environments are summarized as
high-dimensionality, multimodality, and structurelessness as well as huge-scale size. The learning from non-stationary multimodal data is essential for solving many difficult problems in artificial intelligence. However, despite many successful reports, existing machine learning methods have mainly focused on solving practical
problems represented by large-scaled but static databases, such as image classification, tagging, and retrieval.
Hypernetworks are a probabilistic graphical model representing empirical distribution, using a hypergraph structure that is a large collection of many hyperedges encoding the associations among variables. This representation allows the model to be suitable for characterizing the complex relationships between features with a population of building blocks. However, since a hypernetwork is represented by a huge combinatorial feature space, the model requires a large number of hyperedges for handling the multimodal large-scale data and thus faces the scalability problem.
In this dissertation, we propose a deep architecture of
hypernetworks for dealing with the scalability issue for learning from multimodal data with non-stationary properties such as videos, i.e., deep hypernetworks. Deep hypernetworks handle the issues through the abstraction at multiple levels using a hierarchy of multiple hypergraphs. We use a stochastic method based on
Monte-Carlo simulation, a graph MC, for efficiently constructing hypergraphs representing the empirical distribution of the observed data. The structure of a deep hypernetwork continuously changes as the learning proceeds, and this flexibility is contrasted to other
deep learning models. The proposed model incrementally learns from the data, thus handling the nonstationary properties such as concept drift. The abstract representations in the learned models play roles
of multimodal knowledge on data, which are used for the
content-aware crossmodal transformation including vision-language conversion. We view the vision-language conversion as a machine translation, and thus formulate the vision-language translation in terms of the statistical machine translation. Since the knowledge on the video stories are used for translation, we call this story-aware
vision-language translation.
We evaluate deep hypernetworks on large-scale vision-language multimodal data including benmarking datasets and cartoon video series. The experimental results show the deep hypernetworks effectively represent visual-linguistic information abstracted at multiple levels of the data contents as well as the associations between vision and language. We explain how the introduction of a hierarchy deals with the scalability and non-stationary properties. In addition, we present the story-aware vision-language translation on cartoon videos by generating scene images from sentences and descriptive subtitles from scene images. Furthermore, we discuss the
meaning of our model for lifelong learning and the improvement direction for achieving human-level artificial intelligence.1 Introduction
1.1 Background and Motivation
1.2 Problems to be Addressed
1.3 The Proposed Approach and its Contribution
1.4 Organization of the Dissertation
2 RelatedWork
2.1 Multimodal Leanring
2.2 Models for Learning from Multimodal Data
2.2.1 Topic Model-Based Multimodal Leanring
2.2.2 Deep Network-based Multimodal Leanring
2.3 Higher-Order Graphical Models
2.3.1 Hypernetwork Models
2.3.2 Bayesian Evolutionary Learning of Hypernetworks
3 Multimodal Hypernetworks for Text-to-Image Retrievals
3.1 Overview
3.2 Hypernetworks for Multimodal Associations
3.2.1 Multimodal Hypernetworks
3.2.2 Incremental Learning of Multimodal Hypernetworks
3.3 Text-to-Image Crossmodal Inference
3.3.1 Representatation of Textual-Visual Data
3.3.2 Text-to-Image Query Expansion
3.4 Text-to-Image Retrieval via Multimodal Hypernetworks
3.4.1 Data and Experimental Settings
3.4.2 Text-to-Image Retrieval Performance
3.4.3 Incremental Learning for Text-to-Image Retrieval
3.5 Summary
4 Deep Hypernetworks for Multimodal Cocnept Learning from Cartoon Videos
4.1 Overview
4.2 Visual-Linguistic Concept Representation of Catoon Videos
4.3 Deep Hypernetworks for Modeling Visual-Linguistic Concepts
4.3.1 Sparse Population Coding
4.3.2 Deep Hypernetworks for Concept Hierarchies
4.3.3 Implication of Deep Hypernetworks on Cognitive Modeling
4.4 Learning of Deep Hypernetworks
4.4.1 Problem Space of Deep Hypernetworks
4.4.2 Graph Monte-Carlo Simulation
4.4.3 Learning of Concept Layers
4.4.4 Incremental Concept Construction
4.5 Incremental Concept Construction from Catoon Videos
4.5.1 Data Description and Parameter Setup
4.5.2 Concept Representation and Development
4.5.3 Character Classification via Concept Learning
4.5.4 Vision-Language Conversion via Concept Learning
4.6 Summary
5 Story-awareVision-LanguageTranslation usingDeepConcept Hiearachies
5.1 Overview
5.2 Vision-Language Conversion as a Machine Translation
5.2.1 Statistical Machine Translation
5.2.2 Vision-Language Translation
5.3 Story-aware Vision-Language Translation using Deep Concept Hierarchies
5.3.1 Story-aware Vision-Language Translation
5.3.2 Vision-to-Language Translation
5.3.3 Language-to-Vision Translation
5.4 Story-aware Vision-Language Translation on Catoon Videos
5.4.1 Data and Experimental Setting
5.4.2 Scene-to-Sentence Generation
5.4.3 Sentence-to-Scene Generation
5.4.4 Visual-Linguistic Story Summarization of Cartoon Videos
5.5 Summary
6 Concluding Remarks
6.1 Summary of the Dissertation
6.2 Directions for Further Research
Bibliography
한글초록Docto
- …