35 research outputs found
Visual Commonsense R-CNN
We present a novel unsupervised feature representation learning method,
Visual Commonsense Region-based Convolutional Neural Network (VC R-CNN), to
serve as an improved visual region encoder for high-level tasks such as
captioning and VQA. Given a set of detected object regions in an image (e.g.,
using Faster R-CNN), like any other unsupervised feature learning methods
(e.g., word2vec), the proxy training objective of VC R-CNN is to predict the
contextual objects of a region. However, they are fundamentally different: the
prediction of VC R-CNN is by using causal intervention: P(Y|do(X)), while
others are by using the conventional likelihood: P(Y|X). This is also the core
reason why VC R-CNN can learn "sense-making" knowledge like chair can be sat --
while not just "common" co-occurrences such as chair is likely to exist if
table is observed. We extensively apply VC R-CNN features in prevailing models
of three popular tasks: Image Captioning, VQA, and VCR, and observe consistent
performance boosts across them, achieving many new state-of-the-arts. Code and
feature are available at https://github.com/Wangt-CN/VC-R-CNN.Comment: Accepted by CVPR 202
Learning Causal Features for Incremental Object Detection
Object detection limits its recognizable categories during the training
phase, in which it can not cover all objects of interest for users. To satisfy
the practical necessity, the incremental learning ability of the detector
becomes a critical factor for real-world applications. Unfortunately, neural
networks unavoidably meet catastrophic forgetting problem when it is
implemented on a new task. To this end, many incremental object detection
models preserve the knowledge of previous tasks by replaying samples or
distillation from previous models. However, they ignore an important factor
that the performance of the model mostly depends on its feature. These models
try to rouse the memory of the neural network with previous samples but not to
prevent forgetting. To this end, in this paper, we propose an incremental
causal object detection (ICOD) model by learning causal features, which can
adapt to more tasks. Traditional object detection models, unavoidably depend on
the data-bias or data-specific features to get the detection results, which can
not adapt to the new task. When the model meets the requirements of incremental
learning, the data-bias information is not beneficial to the new task, and the
incremental learning may eliminate these features and lead to forgetting. To
this end, our ICOD is introduced to learn the causal features, rather than the
data-bias features when training the detector. Thus, when the model is
implemented to a new task, the causal features of the old task can aid the
incremental learning process to alleviate the catastrophic forgetting problem.
We conduct our model on several experiments, which shows a causal feature
without data-bias can make the model adapt to new tasks better.
\keywords{Object detection, incremental learning, causal feature
ComCLIP: Training-Free Compositional Image and Text Matching
Contrastive Language-Image Pretraining (CLIP) has demonstrated great
zero-shot performance for image-text matching because of its holistic use of
natural language supervision that covers large-scale, open-world visual
concepts. However, it is still challenging to adapt CLIP to compositional image
and text matching -- a more challenging image and matching mask requiring the
model understanding of compositional word concepts and visual components.
Towards better compositional generalization in zero-shot image and text
matching, in this paper, we study the problem from a causal perspective: the
erroneous semantics of individual entities are essentially confounders that
cause the matching failure. Therefore, we propose a novel training-free
compositional CLIP model (ComCLIP). ComCLIP disentangles input images into
subjects, objects, and action sub-images and composes CLIP's vision encoder and
text encoder to perform evolving matching over compositional text embedding and
sub-image embeddings. In this way, ComCLIP can mitigate spurious correlations
introduced by the pretrained CLIP models and dynamically assess the
contribution of each entity when performing image and text matching.
Experiments on compositional image-text matching on SVO and ComVG and general
image-text retrieval on Flickr8K demonstrate the effectiveness of our
plug-and-play method, which boosts the zero-shot inference ability of CLIP even
without further training or fine-tuning of CLIP
VCD: Visual Causality Discovery for Cross-Modal Question Reasoning
Existing visual question reasoning methods usually fail to explicitly
discover the inherent causal mechanism and ignore jointly modeling cross-modal
event temporality and causality. In this paper, we propose a visual question
reasoning framework named Cross-Modal Question Reasoning (CMQR), to discover
temporal causal structure and mitigate visual spurious correlation by causal
intervention. To explicitly discover visual causal structure, the Visual
Causality Discovery (VCD) architecture is proposed to find question-critical
scene temporally and disentangle the visual spurious correlations by
attention-based front-door causal intervention module named Local-Global Causal
Attention Module (LGCAM). To align the fine-grained interactions between
linguistic semantics and spatial-temporal representations, we build an
Interactive Visual-Linguistic Transformer (IVLT) that builds the multi-modal
co-occurrence interactions between visual and linguistic content. Extensive
experiments on four datasets demonstrate the superiority of CMQR for
discovering visual causal structures and achieving robust question reasoning.Comment: 12 pages, 6 figures. arXiv admin note: substantial text overlap with
arXiv:2207.1264
Context De-confounded Emotion Recognition
Context-Aware Emotion Recognition (CAER) is a crucial and challenging task
that aims to perceive the emotional states of the target person with contextual
information. Recent approaches invariably focus on designing sophisticated
architectures or mechanisms to extract seemingly meaningful representations
from subjects and contexts. However, a long-overlooked issue is that a context
bias in existing datasets leads to a significantly unbalanced distribution of
emotional states among different context scenarios. Concretely, the harmful
bias is a confounder that misleads existing models to learn spurious
correlations based on conventional likelihood estimation, significantly
limiting the models' performance. To tackle the issue, this paper provides a
causality-based perspective to disentangle the models from the impact of such
bias, and formulate the causalities among variables in the CAER task via a
tailored causal graph. Then, we propose a Contextual Causal Intervention Module
(CCIM) based on the backdoor adjustment to de-confound the confounder and
exploit the true causal effect for model training. CCIM is plug-in and
model-agnostic, which improves diverse state-of-the-art approaches by
considerable margins. Extensive experiments on three benchmark datasets
demonstrate the effectiveness of our CCIM and the significance of causal
insight.Comment: Accepted by CVPR 2023. CCIM is available at
https://github.com/ydk122024/CCI