6 research outputs found
Image embedding and user multi-preference modeling for data collection sampling
This work proposes an end-to-end user-centric sampling method aimed at selecting the images from an image collection that are able to maximize the information perceived by a given user. As main contributions, we first introduce novel metrics that assess the amount of perceived information retained by the user when experiencing a set of images. Given the actual information present in a set of images, which is the volume spanned by the set in the corresponding latent space, we show how to take into account the user’s preferences in such a volume calculation to build a user-centric metric for the perceived information. Finally, we propose a sampling strategy seeking the minimum set of images that maximize the information perceived by a given user. Experiments using the coco dataset show the ability of the proposed approach to accurately integrate user preference while keeping a reasonable diversity in the sampled image set
Visual Commonsense based Heterogeneous Graph Contrastive Learning
How to select relevant key objects and reason about the complex relationships
cross vision and linguistic domain are two key issues in many multi-modality
applications such as visual question answering (VQA). In this work, we
incorporate the visual commonsense information and propose a heterogeneous
graph contrastive learning method to better finish the visual reasoning task.
Our method is designed as a plug-and-play way, so that it can be quickly and
easily combined with a wide range of representative methods. Specifically, our
model contains two key components: the Commonsense-based Contrastive Learning
and the Graph Relation Network. Using contrastive learning, we guide the model
concentrate more on discriminative objects and relevant visual commonsense
attributes. Besides, thanks to the introduction of the Graph Relation Network,
the model reasons about the correlations between homogeneous edges and the
similarities between heterogeneous edges, which makes information transmission
more effective. Extensive experiments on four benchmarks show that our method
greatly improves seven representative VQA models, demonstrating its
effectiveness and generalizability
Rethinking the Reference-based Distinctive Image Captioning
Distinctive Image Captioning (DIC) -- generating distinctive captions that
describe the unique details of a target image -- has received considerable
attention over the last few years. A recent DIC work proposes to generate
distinctive captions by comparing the target image with a set of
semantic-similar reference images, i.e., reference-based DIC (Ref-DIC). It aims
to make the generated captions can tell apart the target and reference images.
Unfortunately, reference images used by existing Ref-DIC works are easy to
distinguish: these reference images only resemble the target image at
scene-level and have few common objects, such that a Ref-DIC model can
trivially generate distinctive captions even without considering the reference
images. To ensure Ref-DIC models really perceive the unique objects (or
attributes) in target images, we first propose two new Ref-DIC benchmarks.
Specifically, we design a two-stage matching mechanism, which strictly controls
the similarity between the target and reference images at object-/attribute-
level (vs. scene-level). Secondly, to generate distinctive captions, we develop
a strong Transformer-based Ref-DIC baseline, dubbed as TransDIC. It not only
extracts visual features from the target image, but also encodes the
differences between objects in the target and reference images. Finally, for
more trustworthy benchmarking, we propose a new evaluation metric named
DisCIDEr for Ref-DIC, which evaluates both the accuracy and distinctiveness of
the generated captions. Experimental results demonstrate that our TransDIC can
generate distinctive captions. Besides, it outperforms several state-of-the-art
models on the two new benchmarks over different metrics.Comment: ACM MM 202