186 research outputs found
Auto-Encoding Scene Graphs for Image Captioning
We propose Scene Graph Auto-Encoder (SGAE) that incorporates the language
inductive bias into the encoder-decoder image captioning framework for more
human-like captions. Intuitively, we humans use the inductive bias to compose
collocations and contextual inference in discourse. For example, when we see
the relation `person on bike', it is natural to replace `on' with `ride' and
infer `person riding bike on a road' even the `road' is not evident. Therefore,
exploiting such bias as a language prior is expected to help the conventional
encoder-decoder models less likely overfit to the dataset bias and focus on
reasoning. Specifically, we use the scene graph --- a directed graph
() where an object node is connected by adjective nodes and
relationship nodes --- to represent the complex structural layout of both image
() and sentence (). In the textual domain, we use
SGAE to learn a dictionary () that helps to reconstruct sentences
in the pipeline, where encodes the desired language prior;
in the vision-language domain, we use the shared to guide the
encoder-decoder in the pipeline. Thanks to the scene graph
representation and shared dictionary, the inductive bias is transferred across
domains in principle. We validate the effectiveness of SGAE on the challenging
MS-COCO image captioning benchmark, e.g., our SGAE-based single-model achieves
a new state-of-the-art CIDEr-D on the Karpathy split, and a competitive
CIDEr-D (c40) on the official server even compared to other ensemble
models
Learning to Guide Decoding for Image Captioning
Recently, much advance has been made in image captioning, and an
encoder-decoder framework has achieved outstanding performance for this task.
In this paper, we propose an extension of the encoder-decoder framework by
adding a component called guiding network. The guiding network models the
attribute properties of input images, and its output is leveraged to compose
the input of the decoder at each time step. The guiding network can be plugged
into the current encoder-decoder framework and trained in an end-to-end manner.
Hence, the guiding vector can be adaptively learned according to the signal
from the decoder, making itself to embed information from both image and
language. Additionally, discriminative supervision can be employed to further
improve the quality of guidance. The advantages of our proposed approach are
verified by experiments carried out on the MS COCO dataset.Comment: AAAI-1
Zero-Shot Visual Recognition using Semantics-Preserving Adversarial Embedding Networks
We propose a novel framework called Semantics-Preserving Adversarial
Embedding Network (SP-AEN) for zero-shot visual recognition (ZSL), where test
images and their classes are both unseen during training. SP-AEN aims to tackle
the inherent problem --- semantic loss --- in the prevailing family of
embedding-based ZSL, where some semantics would be discarded during training if
they are non-discriminative for training classes, but could become critical for
recognizing test classes. Specifically, SP-AEN prevents the semantic loss by
introducing an independent visual-to-semantic space embedder which disentangles
the semantic space into two subspaces for the two arguably conflicting
objectives: classification and reconstruction. Through adversarial learning of
the two subspaces, SP-AEN can transfer the semantics from the reconstructive
subspace to the discriminative one, accomplishing the improved zero-shot
recognition of unseen classes. Comparing with prior works, SP-AEN can not only
improve classification but also generate photo-realistic images, demonstrating
the effectiveness of semantic preservation. On four popular benchmarks: CUB,
AWA, SUN and aPY, SP-AEN considerably outperforms other state-of-the-art
methods by an absolute performance difference of 12.2\%, 9.3\%, 4.0\%, and
3.6\% in terms of harmonic mean value
Stochastic Dynamics for Video Infilling
In this paper, we introduce a stochastic dynamics video infilling (SDVI)
framework to generate frames between long intervals in a video. Our task
differs from video interpolation which aims to produce transitional frames for
a short interval between every two frames and increase the temporal resolution.
Our task, namely video infilling, however, aims to infill long intervals with
plausible frame sequences. Our framework models the infilling as a constrained
stochastic generation process and sequentially samples dynamics from the
inferred distribution. SDVI consists of two parts: (1) a bi-directional
constraint propagation module to guarantee the spatial-temporal coherence among
frames, (2) a stochastic sampling process to generate dynamics from the
inferred distributions. Experimental results show that SDVI can generate clear
frame sequences with varying contents. Moreover, motions in the generated
sequence are realistic and able to transfer smoothly from the given start frame
to the terminal frame. Our project site is
https://xharlie.github.io/projects/project_sites/SDVI/video_results.htmlComment: Winter Conference on Applications of Computer Vision (WACV 2020
Make the U in UDA Matter: Invariant Consistency Learning for Unsupervised Domain Adaptation
Domain Adaptation (DA) is always challenged by the spurious correlation
between domain-invariant features (e.g., class identity) and domain-specific
features (e.g., environment) that does not generalize to the target domain.
Unfortunately, even enriched with additional unsupervised target domains,
existing Unsupervised DA (UDA) methods still suffer from it. This is because
the source domain supervision only considers the target domain samples as
auxiliary data (e.g., by pseudo-labeling), yet the inherent distribution in the
target domain -- where the valuable de-correlation clues hide -- is
disregarded. We propose to make the U in UDA matter by giving equal status to
the two domains. Specifically, we learn an invariant classifier whose
prediction is simultaneously consistent with the labels in the source domain
and clusters in the target domain, hence the spurious correlation inconsistent
in the target domain is removed. We dub our approach "Invariant CONsistency
learning" (ICON). Extensive experiments show that ICON achieves the
state-of-the-art performance on the classic UDA benchmarks: Office-Home and
VisDA-2017, and outperforms all the conventional methods on the challenging
WILDS 2.0 benchmark. Codes are in https://github.com/yue-zhongqi/ICON.Comment: Accepted by NeurIPS 202
- …