80,691 research outputs found
Decomposed Soft Prompt Guided Fusion Enhancing for Compositional Zero-Shot Learning
Compositional Zero-Shot Learning (CZSL) aims to recognize novel concepts
formed by known states and objects during training. Existing methods either
learn the combined state-object representation, challenging the generalization
of unseen compositions, or design two classifiers to identify state and object
separately from image features, ignoring the intrinsic relationship between
them. To jointly eliminate the above issues and construct a more robust CZSL
system, we propose a novel framework termed Decomposed Fusion with Soft Prompt
(DFSP)1, by involving vision-language models (VLMs) for unseen composition
recognition. Specifically, DFSP constructs a vector combination of learnable
soft prompts with state and object to establish the joint representation of
them. In addition, a cross-modal decomposed fusion module is designed between
the language and image branches, which decomposes state and object among
language features instead of image features. Notably, being fused with the
decomposed features, the image features can be more expressive for learning the
relationship with states and objects, respectively, to improve the response of
unseen compositions in the pair space, hence narrowing the domain gap between
seen and unseen sets. Experimental results on three challenging benchmarks
demonstrate that our approach significantly outperforms other state-of-the-art
methods by large margins.Comment: 10 pages included reference, conferenc
Many Heads but One Brain: Fusion Brain -- a Competition and a Single Multimodal Multitask Architecture
Supporting the current trend in the AI community, we present the AI Journey
2021 Challenge called Fusion Brain, the first competition which is targeted to
make the universal architecture which could process different modalities (in
this case, images, texts, and code) and solve multiple tasks for vision and
language. The Fusion Brain Challenge combines the following specific tasks:
Code2code Translation, Handwritten Text recognition, Zero-shot Object
Detection, and Visual Question Answering. We have created datasets for each
task to test the participants' submissions on it. Moreover, we have collected
and made publicly available a new handwritten dataset in both English and
Russian, which consists of 94,128 pairs of images and texts. We also propose a
multimodal and multitask architecture - a baseline solution, in the center of
which is a frozen foundation model and which has been trained in Fusion mode
along with Single-task mode. The proposed Fusion approach proves to be
competitive and more energy-efficient compared to the task-specific one
Movie Description
Audio Description (AD) provides linguistic descriptions of movies and allows
visually impaired people to follow a movie along with their peers. Such
descriptions are by design mainly visual and thus naturally form an interesting
data source for computer vision and computational linguistics. In this work we
propose a novel dataset which contains transcribed ADs, which are temporally
aligned to full length movies. In addition we also collected and aligned movie
scripts used in prior work and compare the two sources of descriptions. In
total the Large Scale Movie Description Challenge (LSMDC) contains a parallel
corpus of 118,114 sentences and video clips from 202 movies. First we
characterize the dataset by benchmarking different approaches for generating
video descriptions. Comparing ADs to scripts, we find that ADs are indeed more
visual and describe precisely what is shown rather than what should happen
according to the scripts created prior to movie production. Furthermore, we
present and compare the results of several teams who participated in a
challenge organized in the context of the workshop "Describing and
Understanding Video & The Large Scale Movie Description Challenge (LSMDC)", at
ICCV 2015
Strategies for Searching Video Content with Text Queries or Video Examples
The large number of user-generated videos uploaded on to the Internet
everyday has led to many commercial video search engines, which mainly rely on
text metadata for search. However, metadata is often lacking for user-generated
videos, thus these videos are unsearchable by current search engines.
Therefore, content-based video retrieval (CBVR) tackles this metadata-scarcity
problem by directly analyzing the visual and audio streams of each video. CBVR
encompasses multiple research topics, including low-level feature design,
feature fusion, semantic detector training and video search/reranking. We
present novel strategies in these topics to enhance CBVR in both accuracy and
speed under different query inputs, including pure textual queries and query by
video examples. Our proposed strategies have been incorporated into our
submission for the TRECVID 2014 Multimedia Event Detection evaluation, where
our system outperformed other submissions in both text queries and video
example queries, thus demonstrating the effectiveness of our proposed
approaches
Structure propagation for zero-shot learning
The key of zero-shot learning (ZSL) is how to find the information transfer
model for bridging the gap between images and semantic information (texts or
attributes). Existing ZSL methods usually construct the compatibility function
between images and class labels with the consideration of the relevance on the
semantic classes (the manifold structure of semantic classes). However, the
relationship of image classes (the manifold structure of image classes) is also
very important for the compatibility model construction. It is difficult to
capture the relationship among image classes due to unseen classes, so that the
manifold structure of image classes often is ignored in ZSL. To complement each
other between the manifold structure of image classes and that of semantic
classes information, we propose structure propagation (SP) for improving the
performance of ZSL for classification. SP can jointly consider the manifold
structure of image classes and that of semantic classes for approximating to
the intrinsic structure of object classes. Moreover, the SP can describe the
constrain condition between the compatibility function and these manifold
structures for balancing the influence of the structure propagation iteration.
The SP solution provides not only unseen class labels but also the relationship
of two manifold structures that encode the positive transfer in structure
propagation. Experimental results demonstrate that SP can attain the promising
results on the AwA, CUB, Dogs and SUN databases
- …