10 research outputs found
Self-supervised Cross-view Representation Reconstruction for Change Captioning
Change captioning aims to describe the difference between a pair of similar
images. Its key challenge is how to learn a stable difference representation
under pseudo changes caused by viewpoint change. In this paper, we address this
by proposing a self-supervised cross-view representation reconstruction
(SCORER) network. Concretely, we first design a multi-head token-wise matching
to model relationships between cross-view features from similar/dissimilar
images. Then, by maximizing cross-view contrastive alignment of two similar
images, SCORER learns two view-invariant image representations in a
self-supervised way. Based on these, we reconstruct the representations of
unchanged objects by cross-attention, thus learning a stable difference
representation for caption generation. Further, we devise a cross-modal
backward reasoning to improve the quality of caption. This module reversely
models a ``hallucination'' representation with the caption and ``before''
representation. By pushing it closer to the ``after'' representation, we
enforce the caption to be informative about the difference in a self-supervised
manner. Extensive experiments show our method achieves the state-of-the-art
results on four datasets. The code is available at
https://github.com/tuyunbin/SCORER.Comment: Accepted by ICCV 202
Towards Generic Image Manipulation Detection with Weakly-Supervised Self-Consistency Learning
As advanced image manipulation techniques emerge, detecting the manipulation
becomes increasingly important. Despite the success of recent learning-based
approaches for image manipulation detection, they typically require expensive
pixel-level annotations to train, while exhibiting degraded performance when
testing on images that are differently manipulated compared with training
images. To address these limitations, we propose weakly-supervised image
manipulation detection, such that only binary image-level labels (authentic or
tampered with) are required for training purpose. Such a weakly-supervised
setting can leverage more training images and has the potential to adapt
quickly to new manipulation techniques. To improve the generalization ability,
we propose weakly-supervised self-consistency learning (WSCL) to leverage the
weakly annotated images. Specifically, two consistency properties are learned:
multi-source consistency (MSC) and inter-patch consistency (IPC). MSC exploits
different content-agnostic information and enables cross-source learning via an
online pseudo label generation and refinement process. IPC performs global
pair-wise patch-patch relationship reasoning to discover a complete region of
manipulation. Extensive experiments validate that our WSCL, even though is
weakly supervised, exhibits competitive performance compared with
fully-supervised counterpart under both in-distribution and out-of-distribution
evaluations, as well as reasonable manipulation localization ability.Comment: Accepted to ICCV 2023, code: https://github.com/yhZhai/WSC
Rethinking the Reference-based Distinctive Image Captioning
Distinctive Image Captioning (DIC) -- generating distinctive captions that
describe the unique details of a target image -- has received considerable
attention over the last few years. A recent DIC work proposes to generate
distinctive captions by comparing the target image with a set of
semantic-similar reference images, i.e., reference-based DIC (Ref-DIC). It aims
to make the generated captions can tell apart the target and reference images.
Unfortunately, reference images used by existing Ref-DIC works are easy to
distinguish: these reference images only resemble the target image at
scene-level and have few common objects, such that a Ref-DIC model can
trivially generate distinctive captions even without considering the reference
images. To ensure Ref-DIC models really perceive the unique objects (or
attributes) in target images, we first propose two new Ref-DIC benchmarks.
Specifically, we design a two-stage matching mechanism, which strictly controls
the similarity between the target and reference images at object-/attribute-
level (vs. scene-level). Secondly, to generate distinctive captions, we develop
a strong Transformer-based Ref-DIC baseline, dubbed as TransDIC. It not only
extracts visual features from the target image, but also encodes the
differences between objects in the target and reference images. Finally, for
more trustworthy benchmarking, we propose a new evaluation metric named
DisCIDEr for Ref-DIC, which evaluates both the accuracy and distinctiveness of
the generated captions. Experimental results demonstrate that our TransDIC can
generate distinctive captions. Besides, it outperforms several state-of-the-art
models on the two new benchmarks over different metrics.Comment: ACM MM 202
FixMyPose: Pose Correctional Captioning and Retrieval
Interest in physical therapy and individual exercises such as yoga/dance has
increased alongside the well-being trend. However, such exercises are hard to
follow without expert guidance (which is impossible to scale for personalized
feedback to every trainee remotely). Thus, automated pose correction systems
are required more than ever, and we introduce a new captioning dataset named
FixMyPose to address this need. We collect descriptions of correcting a
"current" pose to look like a "target" pose (in both English and Hindi). The
collected descriptions have interesting linguistic properties such as
egocentric relations to environment objects, analogous references, etc.,
requiring an understanding of spatial relations and commonsense knowledge about
postures. Further, to avoid ML biases, we maintain a balance across characters
with diverse demographics, who perform a variety of movements in several
interior environments (e.g., homes, offices). From our dataset, we introduce
the pose-correctional-captioning task and its reverse target-pose-retrieval
task. During the correctional-captioning task, models must generate
descriptions of how to move from the current to target pose image, whereas in
the retrieval task, models should select the correct target pose given the
initial pose and correctional description. We present strong cross-attention
baseline models (uni/multimodal, RL, multilingual) and also show that our
baselines are competitive with other models when evaluated on other
image-difference datasets. We also propose new task-specific metrics
(object-match, body-part-match, direction-match) and conduct human evaluation
for more reliable evaluation, and we demonstrate a large human-model
performance gap suggesting room for promising future work. To verify the
sim-to-real transfer of our FixMyPose dataset, we collect a set of real images
and show promising performance on these images.Comment: AAAI 2021 (18 pages, 16 figures; webpage:
https://fixmypose-unc.github.io/