9 research outputs found
Detailed 2D-3D Joint Representation for Human-Object Interaction
Human-Object Interaction (HOI) detection lies at the core of action
understanding. Besides 2D information such as human/object appearance and
locations, 3D pose is also usually utilized in HOI learning since its
view-independence. However, rough 3D body joints just carry sparse body
information and are not sufficient to understand complex interactions. Thus, we
need detailed 3D body shape to go further. Meanwhile, the interacted object in
3D is also not fully studied in HOI learning. In light of these, we propose a
detailed 2D-3D joint representation learning method. First, we utilize the
single-view human body capture method to obtain detailed 3D body, face and hand
shapes. Next, we estimate the 3D object location and size with reference to the
2D human-object spatial configuration and object category priors. Finally, a
joint learning framework and cross-modal consistency tasks are proposed to
learn the joint HOI representation. To better evaluate the 2D ambiguity
processing capacity of models, we propose a new benchmark named Ambiguous-HOI
consisting of hard ambiguous images. Extensive experiments in large-scale HOI
benchmark and Ambiguous-HOI show impressive effectiveness of our method. Code
and data are available at https://github.com/DirtyHarryLYL/DJ-RN.Comment: Accepted to CVPR 2020, supplementary materials included, code
available:https://github.com/DirtyHarryLYL/DJ-R
Rb-PaStaNet: A Few-Shot Human-Object Interaction Detection Based on Rules and Part States
Existing Human-Object Interaction (HOI) Detection approaches have achieved
great progress on nonrare classes while rare HOI classes are still not
well-detected. In this paper, we intend to apply human prior knowledge into the
existing work. So we add human-labeled rules to PaStaNet and propose
Rb-PaStaNet aimed at improving rare HOI classes detection. Our results show a
certain improvement of the rare classes, while the non-rare classes and the
overall improvement is more considerable
Reformulating HOI Detection as Adaptive Set Prediction
Determining which image regions to concentrate on is critical for
Human-Object Interaction (HOI) detection. Conventional HOI detectors focus on
either detected human and object pairs or pre-defined interaction locations,
which limits learning of the effective features. In this paper, we reformulate
HOI detection as an adaptive set prediction problem, with this novel
formulation, we propose an Adaptive Set-based one-stage framework (AS-Net) with
parallel instance and interaction branches. To attain this, we map a trainable
interaction query set to an interaction prediction set with a transformer. Each
query adaptively aggregates the interaction-relevant features from global
contexts through multi-head co-attention. Besides, the training process is
supervised adaptively by matching each ground-truth with the interaction
prediction. Furthermore, we design an effective instance-aware attention module
to introduce instructive features from the instance branch into the interaction
branch. Our method outperforms previous state-of-the-art methods without any
extra human pose and language features on three challenging HOI detection
datasets. Especially, we achieve over relative improvement on a large
scale HICO-DET dataset. Code is available at
https://github.com/yoyomimi/AS-Net.Comment: Accepted to CVPR 202
Detecting Human-Object Interactions with Action Co-occurrence Priors
A common problem in human-object interaction (HOI) detection task is that
numerous HOI classes have only a small number of labeled examples, resulting in
training sets with a long-tailed distribution. The lack of positive labels can
lead to low classification accuracy for these classes. Towards addressing this
issue, we observe that there exist natural correlations and anti-correlations
among human-object interactions. In this paper, we model the correlations as
action co-occurrence matrices and present techniques to learn these priors and
leverage them for more effective training, especially in rare classes. The
utility of our approach is demonstrated experimentally, where the performance
of our approach exceeds the state-of-the-art methods on both of the two leading
HOI detection benchmark datasets, HICO-Det and V-COCO.Comment: ECCV 2020. Source code :
https://github.com/Dong-JinKim/ActionCooccurrencePriors
PGT: A Progressive Method for Training Models on Long Videos
Convolutional video models have an order of magnitude larger computational
complexity than their counterpart image-level models. Constrained by
computational resources, there is no model or training method that can train
long video sequences end-to-end. Currently, the main-stream method is to split
a raw video into clips, leading to incomplete fragmentary temporal information
flow. Inspired by natural language processing techniques dealing with long
sentences, we propose to treat videos as serial fragments satisfying Markov
property, and train it as a whole by progressively propagating information
through the temporal dimension in multiple steps. This progressive training
(PGT) method is able to train long videos end-to-end with limited resources and
ensures the effective transmission of information. As a general and robust
training method, we empirically demonstrate that it yields significant
performance improvements on different models and datasets. As an illustrative
example, the proposed method improves SlowOnly network by 3.7 mAP on Charades
and 1.9 top-1 accuracy on Kinetics with negligible parameter and computation
overhead. Code is available at https://github.com/BoPang1996/PGT.Comment: CVPR21, Ora
Glance and Gaze: Inferring Action-aware Points for One-Stage Human-Object Interaction Detection
Modern human-object interaction (HOI) detection approaches can be divided
into one-stage methods and twostage ones. One-stage models are more efficient
due to their straightforward architectures, but the two-stage models are still
advantageous in accuracy. Existing one-stage models usually begin by detecting
predefined interaction areas or points, and then attend to these areas only for
interaction prediction; therefore, they lack reasoning steps that dynamically
search for discriminative cues. In this paper, we propose a novel one-stage
method, namely Glance and Gaze Network (GGNet), which adaptively models a set
of actionaware points (ActPoints) via glance and gaze steps. The glance step
quickly determines whether each pixel in the feature maps is an interaction
point. The gaze step leverages feature maps produced by the glance step to
adaptively infer ActPoints around each pixel in a progressive manner. Features
of the refined ActPoints are aggregated for interaction prediction. Moreover,
we design an actionaware approach that effectively matches each detected
interaction with its associated human-object pair, along with a novel hard
negative attentive loss to improve the optimization of GGNet. All the above
operations are conducted simultaneously and efficiently for all pixels in the
feature maps. Finally, GGNet outperforms state-of-the-art methods by
significant margins on both V-COCO and HICODET benchmarks. Code of GGNet is
available at https: //github.com/SherlockHolmes221/GGNet.Comment: Accepted to CVPR202
Affordance Transfer Learning for Human-Object Interaction Detection
Reasoning the human-object interactions (HOI) is essential for deeper scene
understanding, while object affordances (or functionalities) are of great
importance for human to discover unseen HOIs with novel objects. Inspired by
this, we introduce an affordance transfer learning approach to jointly detect
HOIs with novel objects and recognize affordances. Specifically, HOI
representations can be decoupled into a combination of affordance and object
representations, making it possible to compose novel interactions by combining
affordance representations and novel object representations from additional
images, i.e. transferring the affordance to novel objects. With the proposed
affordance transfer learning, the model is also capable of inferring the
affordances of novel objects from known affordance representations. The
proposed method can thus be used to 1) improve the performance of HOI
detection, especially for the HOIs with unseen objects; and 2) infer the
affordances of novel objects. Experimental results on two datasets, HICO-DET
and HOI-COCO (from V-COCO), demonstrate significant improvements over recent
state-of-the-art methods for HOI detection and object affordance detection.
Code is available at https://github.com/zhihou7/HOI-CLComment: Accepted to CVPR2021; add a new but important ablated experiment in
appendix(union box verb representation)
Transferable Interactiveness Knowledge for Human-Object Interaction Detection
Human-Object Interaction (HOI) detection is an important problem to
understand how humans interact with objects. In this paper, we explore
interactiveness knowledge which indicates whether a human and an object
interact with each other or not. We found that interactiveness knowledge can be
learned across HOI datasets and bridge the gap between diverse HOI category
settings. Our core idea is to exploit an interactiveness network to learn the
general interactiveness knowledge from multiple HOI datasets and perform
Non-Interaction Suppression (NIS) before HOI classification in inference. On
account of the generalization ability of interactiveness, interactiveness
network is a transferable knowledge learner and can be cooperated with any HOI
detection models to achieve desirable results. We utilize the human instance
and body part features together to learn the interactiveness in hierarchical
paradigm, i.e., instance-level and body part-level interactivenesses.
Thereafter, a consistency task is proposed to guide the learning and extract
deeper interactive visual clues. We extensively evaluate the proposed method on
HICO-DET, V-COCO, and a newly constructed PaStaNet-HOI dataset. With the
learned interactiveness, our method outperforms state-of-the-art HOI detection
methods, verifying its efficacy and flexibility. Code is available at
https://github.com/DirtyHarryLYL/Transferable-Interactiveness-Network.Comment: TPAMI version of our CVPR2019 paper with a new benchmark
PaStaNet-HOI. Code:
https://github.com/DirtyHarryLYL/Transferable-Interactiveness-Network. arXiv
admin note: substantial text overlap with arXiv:1811.0826
Polysemy Deciphering Network for Robust Human-Object Interaction Detection
Human-Object Interaction (HOI) detection is important to human-centric scene
understanding tasks. Existing works tend to assume that the same verb has
similar visual characteristics in different HOI categories, an approach that
ignores the diverse semantic meanings of the verb. To address this issue, in
this paper, we propose a novel Polysemy Deciphering Network (PD-Net) that
decodes the visual polysemy of verbs for HOI detection in three distinct ways.
First, we refine features for HOI detection to be polysemyaware through the use
of two novel modules: namely, Language Prior-guided Channel Attention (LPCA)
and Language Prior-based Feature Augmentation (LPFA). LPCA highlights important
elements in human and object appearance features for each HOI category to be
identified; moreover, LPFA augments human pose and spatial features for HOI
detection using language priors, enabling the verb classifiers to receive
language hints that reduce intra-class variation for the same verb. Second, we
introduce a novel Polysemy-Aware Modal Fusion module (PAMF), which guides
PD-Net to make decisions based on feature types deemed more important according
to the language priors. Third, we propose to relieve the verb polysemy problem
through sharing verb classifiers for semantically similar HOI categories.
Furthermore, to expedite research on the verb polysemy problem, we build a new
benchmark dataset named HOI-VerbPolysemy (HOIVP), which includes common verbs
(predicates) that have diverse semantic meanings in the real world. Finally,
through deciphering the visual polysemy of verbs, our approach is demonstrated
to outperform state-of-the-art methods by significant margins on the HICO-DET,
V-COCO, and HOI-VP databases. Code and data in this paper are available at
https://github.com/MuchHair/PD-Net.Comment: The IJCV version extended significantly from our ECCV2020 conference
pape