10 research outputs found
Visual Compositional Learning for Human-Object Interaction Detection
Human-Object interaction (HOI) detection aims to localize and infer
relationships between human and objects in an image. It is challenging because
an enormous number of possible combinations of objects and verbs types forms a
long-tail distribution. We devise a deep Visual Compositional Learning (VCL)
framework, which is a simple yet efficient framework to effectively address
this problem. VCL first decomposes an HOI representation into object and verb
specific features, and then composes new interaction samples in the feature
space via stitching the decomposed features. The integration of decomposition
and composition enables VCL to share object and verb features among different
HOI samples and images, and to generate new interaction samples and new types
of HOI, and thus largely alleviates the long-tail distribution problem and
benefits low-shot or zero-shot HOI detection. Extensive experiments demonstrate
that the proposed VCL can effectively improve the generalization of HOI
detection on HICO-DET and V-COCO and outperforms the recent state-of-the-art
methods on HICO-DET. Code is available at https://github.com/zhihou7/VCL.Comment: Accepted in ECCV202
DecAug: Augmenting HOI Detection via Decomposition
Human-object interaction (HOI) detection requires a large amount of annotated
data. Current algorithms suffer from insufficient training samples and category
imbalance within datasets. To increase data efficiency, in this paper, we
propose an efficient and effective data augmentation method called DecAug for
HOI detection. Based on our proposed object state similarity metric, object
patterns across different HOIs are shared to augment local object appearance
features without changing their state. Further, we shift spatial correlation
between humans and objects to other feasible configurations with the aid of a
pose-guided Gaussian Mixture Model while preserving their interactions.
Experiments show that our method brings up to 3.3 mAP and 1.6 mAP improvements
on V-COCO and HICODET dataset for two advanced models. Specifically,
interactions with fewer samples enjoy more notable improvement. Our method can
be easily integrated into various HOI detection models with negligible extra
computational consumption. Our code will be made publicly available
Class-level Structural Relation Modelling and Smoothing for Visual Representation Learning
Representation learning for images has been advanced by recent progress in
more complex neural models such as the Vision Transformers and new learning
theories such as the structural causal models. However, these models mainly
rely on the classification loss to implicitly regularize the class-level data
distributions, and they may face difficulties when handling classes with
diverse visual patterns. We argue that the incorporation of the structural
information between data samples may improve this situation. To achieve this
goal, this paper presents a framework termed \textbf{C}lass-level Structural
Relation Modeling and Smoothing for Visual Representation Learning (CSRMS),
which includes the Class-level Relation Modelling, Class-aware Graph Sampling,
and Relational Graph-Guided Representation Learning modules to model a
relational graph of the entire dataset and perform class-aware smoothing and
regularization operations to alleviate the issue of intra-class visual
diversity and inter-class similarity. Specifically, the Class-level Relation
Modelling module uses a clustering algorithm to learn the data distributions in
the feature space and identify three types of class-level sample relations for
the training set; Class-aware Graph Sampling module extends typical training
batch construction process with three strategies to sample dataset-level
sub-graphs; and Relational Graph-Guided Representation Learning module employs
a graph convolution network with knowledge-guided smoothing operations to ease
the projection from different visual patterns to the same class. Experiments
demonstrate the effectiveness of structured knowledge modelling for enhanced
representation learning and show that CSRMS can be incorporated with any
state-of-the-art visual representation learning models for performance gains.
The source codes and demos have been released at
https://github.com/czt117/CSRMS
Towards Hard-Positive Query Mining for DETR-based Human-Object Interaction Detection
Human-Object Interaction (HOI) detection is a core task for high-level image
understanding. Recently, Detection Transformer (DETR)-based HOI detectors have
become popular due to their superior performance and efficient structure.
However, these approaches typically adopt fixed HOI queries for all testing
images, which is vulnerable to the location change of objects in one specific
image. Accordingly, in this paper, we propose to enhance DETR's robustness by
mining hard-positive queries, which are forced to make correct predictions
using partial visual cues. First, we explicitly compose hard-positive queries
according to the ground-truth (GT) position of labeled human-object pairs for
each training image. Specifically, we shift the GT bounding boxes of each
labeled human-object pair so that the shifted boxes cover only a certain
portion of the GT ones. We encode the coordinates of the shifted boxes for each
labeled human-object pair into an HOI query. Second, we implicitly construct
another set of hard-positive queries by masking the top scores in
cross-attention maps of the decoder layers. The masked attention maps then only
cover partial important cues for HOI predictions. Finally, an alternate
strategy is proposed that efficiently combines both types of hard queries. In
each iteration, both DETR's learnable queries and one selected type of
hard-positive queries are adopted for loss computation. Experimental results
show that our proposed approach can be widely applied to existing DETR-based
HOI detectors. Moreover, we consistently achieve state-of-the-art performance
on three benchmarks: HICO-DET, V-COCO, and HOI-A. Code is available at
https://github.com/MuchHair/HQM.Comment: Accepted by ECCV202
Visual Semantic Parsing: From Images to Abstract Meaning Representation
The success of scene graphs for visual scene understanding has brought
attention to the benefits of abstracting a visual input (e.g., image) into a
structured representation, where entities (people and objects) are nodes
connected by edges specifying their relations. Building these representations,
however, requires expensive manual annotation in the form of images paired with
their scene graphs or frames. These formalisms remain limited in the nature of
entities and relations they can capture. In this paper, we propose to leverage
a widely-used meaning representation in the field of natural language
processing, the Abstract Meaning Representation (AMR), to address these
shortcomings. Compared to scene graphs, which largely emphasize spatial
relationships, our visual AMR graphs are more linguistically informed, with a
focus on higher-level semantic concepts extrapolated from visual input.
Moreover, they allow us to generate meta-AMR graphs to unify information
contained in multiple image descriptions under one representation. Through
extensive experimentation and analysis, we demonstrate that we can re-purpose
an existing text-to-AMR parser to parse images into AMRs. Our findings point to
important future research directions for improved scene understanding.Comment: published in CoNLL 202
Discovering Human-Object Interaction Concepts via Self-Compositional Learning
A comprehensive understanding of human-object interaction (HOI) requires
detecting not only a small portion of predefined HOI concepts (or categories)
but also other reasonable HOI concepts, while current approaches usually fail
to explore a huge portion of unknown HOI concepts (i.e., unknown but reasonable
combinations of verbs and objects). In this paper, 1) we introduce a novel and
challenging task for a comprehensive HOI understanding, which is termed as HOI
Concept Discovery; and 2) we devise a self-compositional learning framework (or
SCL) for HOI concept discovery. Specifically, we maintain an online updated
concept confidence matrix during training: 1) we assign pseudo-labels for all
composite HOI instances according to the concept confidence matrix for
self-training; and 2) we update the concept confidence matrix using the
predictions of all composite HOI instances. Therefore, the proposed method
enables the learning on both known and unknown HOI concepts. We perform
extensive experiments on several popular HOI datasets to demonstrate the
effectiveness of the proposed method for HOI concept discovery, object
affordance recognition and HOI detection. For example, the proposed
self-compositional learning framework significantly improves the performance of
1) HOI concept discovery by over 10% on HICO-DET and over 3% on V-COCO,
respectively; 2) object affordance recognition by over 9% mAP on MS-COCO and
HICO-DET; and 3) rare-first and non-rare-first unknown HOI detection relatively
over 30% and 20%, respectively. Code and models will be made publicly available
at https://github.com/zhihou7/HOI-CL.Comment: Technical Repor
Mining Cross-Person Cues for Body-Part Interactiveness Learning in HOI Detection
Human-Object Interaction (HOI) detection plays a crucial role in activity
understanding. Though significant progress has been made, interactiveness
learning remains a challenging problem in HOI detection: existing methods
usually generate redundant negative H-O pair proposals and fail to effectively
extract interactive pairs. Though interactiveness has been studied in both
whole body- and part- level and facilitates the H-O pairing, previous works
only focus on the target person once (i.e., in a local perspective) and
overlook the information of the other persons. In this paper, we argue that
comparing body-parts of multi-person simultaneously can afford us more useful
and supplementary interactiveness cues. That said, to learn body-part
interactiveness from a global perspective: when classifying a target person's
body-part interactiveness, visual cues are explored not only from
herself/himself but also from other persons in the image. We construct
body-part saliency maps based on self-attention to mine cross-person
informative cues and learn the holistic relationships between all the
body-parts. We evaluate the proposed method on widely-used benchmarks HICO-DET
and V-COCO. With our new perspective, the holistic global-local body-part
interactiveness learning achieves significant improvements over
state-of-the-art. Our code is available at
https://github.com/enlighten0707/Body-Part-Map-for-Interactiveness.Comment: To appear in ECCV 202
Panoptic Scene Graph Generation
Existing research addresses scene graph generation (SGG) -- a critical
technology for scene understanding in images -- from a detection perspective,
i.e., objects are detected using bounding boxes followed by prediction of their
pairwise relationships. We argue that such a paradigm causes several problems
that impede the progress of the field. For instance, bounding box-based labels
in current datasets usually contain redundant classes like hairs, and leave out
background information that is crucial to the understanding of context. In this
work, we introduce panoptic scene graph generation (PSG), a new problem task
that requires the model to generate a more comprehensive scene graph
representation based on panoptic segmentations rather than rigid bounding
boxes. A high-quality PSG dataset, which contains 49k well-annotated
overlapping images from COCO and Visual Genome, is created for the community to
keep track of its progress. For benchmarking, we build four two-stage
baselines, which are modified from classic methods in SGG, and two one-stage
baselines called PSGTR and PSGFormer, which are based on the efficient
Transformer-based detector, i.e., DETR. While PSGTR uses a set of queries to
directly learn triplets, PSGFormer separately models the objects and relations
in the form of queries from two Transformer decoders, followed by a
prompting-like relation-object matching mechanism. In the end, we share
insights on open challenges and future directions.Comment: Accepted to ECCV'22 (Paper ID #222, Final Score 2222). Project Page:
https://psgdataset.org/. OpenPSG Codebase:
https://github.com/Jingkang50/OpenPS
Learning Transferable Representations for Hierarchical Relationship Exploration
The visual scenes are composed of basic elements, such as objects, parts, and other semantic regions. It is well-acknowledged that humans perceive the world in a compositional and hierarchical way in which visual scenes are treated as a layout of distinct semantic objects/attributes/parts. Those separated objects/attributes/parts are linked together via different relationships, including visual relationships and semantic relationships. Particularly, the shared parts/attributes/objects of the visual concepts (object, visual relationships), are shared and thus transferable among different visual concepts. Humans can easily imagine a new composite concept from the shared parts of different concepts, while one of the important shortcomings of current deep neural networks is the compositional perception ability and thus it requires a large scale of data to optimize the deep neural networks. From the perspective of compositional perception, this thesis thinks one of the limitations of typical neural networks is that the factor representations of deep neural networks are not sharable and transferable among different concepts. Therefore, the thesis introduces various techniques, including compositional learning framework, compositional invariant learning, and BatchFormer module, to enable the factor representations of deep neural networks sharable and transferable among different concepts for hierarchical relationship exploration, involving human-object interaction, 3D human-object interaction and sample relationships