31 research outputs found
Unpaired Image Captioning via Scene Graph Alignments
Most of current image captioning models heavily rely on paired image-caption
datasets. However, getting large scale image-caption paired data is
labor-intensive and time-consuming. In this paper, we present a scene
graph-based approach for unpaired image captioning. Our framework comprises an
image scene graph generator, a sentence scene graph generator, a scene graph
encoder, and a sentence decoder. Specifically, we first train the scene graph
encoder and the sentence decoder on the text modality. To align the scene
graphs between images and sentences, we propose an unsupervised feature
alignment method that maps the scene graph features from the image to the
sentence modality. Experimental results show that our proposed model can
generate quite promising results without using any image-caption training
pairs, outperforming existing methods by a wide margin.Comment: Accepted in ICCV 201
Scene Graph Generation with External Knowledge and Image Reconstruction
Scene graph generation has received growing attention with the advancements
in image understanding tasks such as object detection, attributes and
relationship prediction,~\etc. However, existing datasets are biased in terms
of object and relationship labels, or often come with noisy and missing
annotations, which makes the development of a reliable scene graph prediction
model very challenging. In this paper, we propose a novel scene graph
generation algorithm with external knowledge and image reconstruction loss to
overcome these dataset issues. In particular, we extract commonsense knowledge
from the external knowledge base to refine object and phrase features for
improving generalizability in scene graph generation. To address the bias of
noisy object annotations, we introduce an auxiliary image reconstruction path
to regularize the scene graph generation network. Extensive experiments show
that our framework can generate better scene graphs, achieving the
state-of-the-art performance on two benchmark datasets: Visual Relationship
Detection and Visual Genome datasets.Comment: 10 pages, 5 figures, Accepted in CVPR 201
Exploiting Semantic Embedding And Visual Feature For Facial Action Unit Detection
Recent study on detecting facial action units (AU) has utilized auxiliary information (i.e., facial landmarks, relationship among AUs and expressions, web facial images, etc.), in order to improve the AU detection performance. As of now, no semantic information of AUs has yet been explored for such a task. As a matter of fact, AU semantic descriptions provide much more information than the binary AU labels alone, thus we propose to exploit the Semantic Embedding and Visual feature (SEV-Net) for AU detection. More specifically, AU semantic embeddings are obtained through both Intra-AU and Inter-AU attention modules, where the Intra-AU attention module captures the relation among words within each sentence that describes individual AU, and the Inter-AU attention module focuses on the relation among those sentences. The learned AU semantic embeddings are then used as guidance for the generation of attention maps through a cross-modality attention network. The generated cross-modality attention maps are further used as weights for the aggregated feature. Our proposed method is unique in that the semantic features are exploited as the first of this kind. The approach has been evaluated on three public AU-coded facial expression databases and has achieved a superior performance than the state-of-the-art peer methods
Unsupervised Cross-lingual Image Captioning
Most recent image captioning works are conducted in English as the majority
of image-caption datasets are in English. However, there are a large amount of
non-native English speakers worldwide. Generating image captions in different
languages is worth exploring. In this paper, we present a novel unsupervised
method to generate image captions without using any caption corpus. Our method
relies on 1) a cross-lingual auto-encoding, which learns the scene graph
mapping function along with the scene graph encoders and sentence decoders on
machine translation parallel corpora, and 2) an unsupervised feature mapping,
which seeks to map the encoded scene graph features from image modality to
sentence modality. By leveraging cross-lingual auto-encoding, cross-modal
feature mapping, and adversarial learning, our method can learn an image
captioner to generate captions in different languages. We verify the
effectiveness of our proposed method on the Chinese image caption generation.
The comparisons against several baseline methods demonstrate the effectiveness
of our approach.Comment: 8 page
Learning the Visualness of Text Using Large Vision-Language Models
Visual text evokes an image in a person's mind, while non-visual text fails
to do so. A method to automatically detect visualness in text will enable
text-to-image retrieval and generation models to augment text with relevant
images. This is particularly challenging with long-form text as text-to-image
generation and retrieval models are often triggered for text that is designed
to be explicitly visual in nature, whereas long-form text could contain many
non-visual sentences. To this end, we curate a dataset of 3,620 English
sentences and their visualness scores provided by multiple human annotators. We
also propose a fine-tuning strategy that adapts large vision-language models
like CLIP by modifying the model's contrastive learning objective to map text
identified as non-visual to a common NULL image while matching visual text to
their corresponding images in the document. We evaluate the proposed approach
on its ability to (i) classify visual and non-visual text accurately, and (ii)
attend over words that are identified as visual in psycholinguistic studies.
Empirical evaluation indicates that our approach performs better than several
heuristics and baseline models for the proposed task. Furthermore, to highlight
the importance of modeling the visualness of text, we conduct qualitative
analyses of text-to-image generation systems like DALL-E. Project webpage:
https://gaurav22verma.github.io/text-visualness/Comment: Accepted at EMNLP 2023 (Main, long); 9 pages, 5 figure
Delving into Out-of-Distribution Detection with Vision-Language Representations
Recognizing out-of-distribution (OOD) samples is critical for machine
learning systems deployed in the open world. The vast majority of OOD detection
methods are driven by a single modality (e.g., either vision or language),
leaving the rich information in multi-modal representations untapped. Inspired
by the recent success of vision-language pre-training, this paper enriches the
landscape of OOD detection from a single-modal to a multi-modal regime.
Particularly, we propose Maximum Concept Matching (MCM), a simple yet effective
zero-shot OOD detection method based on aligning visual features with textual
concepts. We contribute in-depth analysis and theoretical insights to
understand the effectiveness of MCM. Extensive experiments demonstrate that MCM
achieves superior performance on a wide variety of real-world tasks. MCM with
vision-language features outperforms a common baseline with pure visual
features on a hard OOD task with semantically similar classes by 13.1% (AUROC).
Code is available at https://github.com/deeplearning-wisc/MCM.Comment: 36th Conference on Neural Information Processing Systems (NeurIPS
2022