850 research outputs found
Clue: Cross-modal Coherence Modeling for Caption Generation
We use coherence relations inspired by computational models of discourse to
study the information needs and goals of image captioning. Using an annotation
protocol specifically devised for capturing image--caption coherence relations,
we annotate 10,000 instances from publicly-available image--caption pairs. We
introduce a new task for learning inferences in imagery and text, coherence
relation prediction, and show that these coherence annotations can be exploited
to learn relation classifiers as an intermediary step, and also train
coherence-aware, controllable image captioning models. The results show a
dramatic improvement in the consistency and quality of the generated captions
with respect to information needs specified via coherence relations.Comment: Accepted as a long paper to ACL 202
GOOD: Towards Domain Generalized Orientated Object Detection
Oriented object detection has been rapidly developed in the past few years,
but most of these methods assume the training and testing images are under the
same statistical distribution, which is far from reality. In this paper, we
propose the task of domain generalized oriented object detection, which intends
to explore the generalization of oriented object detectors on arbitrary unseen
target domains. Learning domain generalized oriented object detectors is
particularly challenging, as the cross-domain style variation not only
negatively impacts the content representation, but also leads to unreliable
orientation predictions. To address these challenges, we propose a generalized
oriented object detector (GOOD). After style hallucination by the emerging
contrastive language-image pre-training (CLIP), it consists of two key
components, namely, rotation-aware content consistency learning (RAC) and style
consistency learning (SEC). The proposed RAC allows the oriented object
detector to learn stable orientation representation from style-diversified
samples. The proposed SEC further stabilizes the generalization ability of
content representation from different image styles. Extensive experiments on
multiple cross-domain settings show the state-of-the-art performance of GOOD.
Source code will be publicly available.Comment: 8 pages, 6 figure
Compare More Nuanced:Pairwise Alignment Bilinear Network For Few-shot Fine-grained Learning
The recognition ability of human beings is developed in a progressive way.
Usually, children learn to discriminate various objects from coarse to
fine-grained with limited supervision. Inspired by this learning process, we
propose a simple yet effective model for the Few-Shot Fine-Grained (FSFG)
recognition, which tries to tackle the challenging fine-grained recognition
task using meta-learning. The proposed method, named Pairwise Alignment
Bilinear Network (PABN), is an end-to-end deep neural network. Unlike
traditional deep bilinear networks for fine-grained classification, which adopt
the self-bilinear pooling to capture the subtle features of images, the
proposed model uses a novel pairwise bilinear pooling to compare the nuanced
differences between base images and query images for learning a deep distance
metric. In order to match base image features with query image features, we
design feature alignment losses before the proposed pairwise bilinear pooling.
Experiment results on four fine-grained classification datasets and one generic
few-shot dataset demonstrate that the proposed model outperforms both the
state-ofthe-art few-shot fine-grained and general few-shot methods.Comment: ICME 2019 Ora
- …