Search CORE

47 research outputs found

Detecting Human-Object Contact in Images

Author: Black M.J.
Chen Y.
Dwivedi S.K.
Tzionas D.
Publication venue: IEEE Computer Society
Publication date: 01/01/2023
Field of study

Humans constantly contact objects to move and perform tasks. Thus, detecting human-object contact is important for building human-centered artificial intelligence. However, there exists no robust method to detect contact between the body and the scene from an image, and there exists no dataset to learn such a detector. We fill this gap with HOT ("Human-Object conTact"), a new dataset of human-object contacts for images. To build HOT, we use two data sources: (1) We use the PROX dataset of 3D human meshes moving in 3D scenes, and automatically annotate 2D image areas for contact via 3D mesh proximity and projection. (2) We use the V-COCO, HAKE and Watch-n-Patch datasets, and ask trained annotators to draw polygons for the 2D image areas where contact takes place. We also annotate the involved body part of the human body. We use our HOT dataset to train a new contact detector, which takes a single color image as input, and outputs 2D contact heatmaps as well as the body-part labels that are in contact. This is a new and challenging task that extends current foot-ground or hand-object contact detectors to the full generality of the whole body. The detector uses a part-attention branch to guide contact estimation through the context of the surrounding body parts and scene. We evaluate our detector extensively, and quantitative results show that our model outperforms baselines, and that all components contribute to better performance. Results on images from an online repository show reasonable detections and generalizability

International Migration, Integration and Social Cohesion online publications

UvA-DARE

CHORE: Contact, Human and Object REconstruction from a single RGB image

Author: Bhatnagar Bharat Lal
Pons-Moll Gerard
Xie Xianghui
Publication venue
Publication date: 05/04/2022
Field of study

While most works in computer vision and learning have focused on perceiving 3D humans from single images in isolation, in this work we focus on capturing 3D humans interacting with objects. The problem is extremely challenging due to heavy occlusions between human and object, diverse interaction types and depth ambiguity. In this paper, we introduce CHORE, a novel method that learns to jointly reconstruct human and object from a single image. CHORE takes inspiration from recent advances in implicit surface learning and classical model-based fitting. We compute a neural reconstruction of human and object represented implicitly with two unsigned distance fields, and additionally predict a correspondence field to a parametric body as well as an object pose field. This allows us to robustly fit a parametric body model and a 3D object template, while reasoning about interactions. Furthermore, prior pixel-aligned implicit learning methods use synthetic data and make assumptions that are not met in real data. We propose a simple yet effective depth-aware scaling that allows more efficient shape learning on real data. Our experiments show that our joint reconstruction learned with the proposed strategy significantly outperforms the SOTA. Our code and models will be released to foster future research in this direction.Comment: 19 pages, 7 figure

arXiv.org e-Print Archive

一人称視点映像からの手操作解析に関する研究

Author: Cai Minjie
蔡敏捷
Publication venue: 情報理工学系研究科電子情報学専攻
Publication date: 24/03/2016
Field of study

学位の種別: 課程博士審査委員会委員 : （主査）国立情報学研究所教授佐藤真一, 東京大学教授佐藤洋一, 東京大学教授相澤清晴, 東京大学准教授山崎俊彦, 東京大学准教授大石岳史University of Tokyo(東京大学

Future Person Localization in First-Person Videos

Author: Mangalam Karttikeya
Sato Yoichi
Yagi Takuma
Yonetani Ryo
Publication venue
Publication date: 27/03/2018
Field of study

We present a new task that predicts future locations of people observed in first-person videos. Consider a first-person video stream continuously recorded by a wearable camera. Given a short clip of a person that is extracted from the complete stream, we aim to predict that person's location in future frames. To facilitate this future person localization ability, we make the following three key observations: a) First-person videos typically involve significant ego-motion which greatly affects the location of the target person in future frames; b) Scales of the target person act as a salient cue to estimate a perspective effect in first-person videos; c) First-person videos often capture people up-close, making it easier to leverage target poses (e.g., where they look) for predicting their future locations. We incorporate these three observations into a prediction framework with a multi-stream convolution-deconvolution architecture. Experimental results reveal our method to be effective on our new dataset as well as on a public social interaction dataset.Comment: Accepted to CVPR 201

arXiv.org e-Print Archive

Crossref

Cross-view and Cross-pose Completion for 3D Human Understanding

Author: Armando Matthieu
Baradel Fabien
Brégier Romain
Galaaoui Salma
Leroy Vincent
Lucas Thomas
Rogez Grégory
Weinzaepfel Philippe
Publication venue
Publication date: 15/11/2023
Field of study

Human perception and understanding is a major domain of computer vision which, like many other vision subdomains recently, stands to gain from the use of large models pre-trained on large datasets. We hypothesize that the most common pre-training strategy of relying on general purpose, object-centric image datasets such as ImageNet, is limited by an important domain shift. On the other hand, collecting domain specific ground truth such as 2D or 3D labels does not scale well. Therefore, we propose a pre-training approach based on self-supervised learning that works on human-centric data using only images. Our method uses pairs of images of humans: the first is partially masked and the model is trained to reconstruct the masked parts given the visible ones and a second image. It relies on both stereoscopic (cross-view) pairs, and temporal (cross-pose) pairs taken from videos, in order to learn priors about 3D as well as human motion. We pre-train a model for body-centric tasks and one for hand-centric tasks. With a generic transformer architecture, these models outperform existing self-supervised pre-training methods on a wide set of human-centric downstream tasks, and obtain state-of-the-art performance for instance when fine-tuning for model-based and model-free human mesh recovery

arXiv.org e-Print Archive