47 research outputs found
Detecting Human-Object Contact in Images
Humans constantly contact objects to move and perform tasks. Thus, detecting human-object contact is important for building human-centered artificial intelligence. However, there exists no robust method to detect contact between the body and the scene from an image, and there exists no dataset to learn such a detector. We fill this gap with HOT ("Human-Object conTact"), a new dataset of human-object contacts for images. To build HOT, we use two data sources: (1) We use the PROX dataset of 3D human meshes moving in 3D scenes, and automatically annotate 2D image areas for contact via 3D mesh proximity and projection. (2) We use the V-COCO, HAKE and Watch-n-Patch datasets, and ask trained annotators to draw polygons for the 2D image areas where contact takes place. We also annotate the involved body part of the human body. We use our HOT dataset to train a new contact detector, which takes a single color image as input, and outputs 2D contact heatmaps as well as the body-part labels that are in contact. This is a new and challenging task that extends current foot-ground or hand-object contact detectors to the full generality of the whole body. The detector uses a part-attention branch to guide contact estimation through the context of the surrounding body parts and scene. We evaluate our detector extensively, and quantitative results show that our model outperforms baselines, and that all components contribute to better performance. Results on images from an online repository show reasonable detections and generalizability
CHORE: Contact, Human and Object REconstruction from a single RGB image
While most works in computer vision and learning have focused on perceiving
3D humans from single images in isolation, in this work we focus on capturing
3D humans interacting with objects. The problem is extremely challenging due to
heavy occlusions between human and object, diverse interaction types and depth
ambiguity. In this paper, we introduce CHORE, a novel method that learns to
jointly reconstruct human and object from a single image. CHORE takes
inspiration from recent advances in implicit surface learning and classical
model-based fitting. We compute a neural reconstruction of human and object
represented implicitly with two unsigned distance fields, and additionally
predict a correspondence field to a parametric body as well as an object pose
field. This allows us to robustly fit a parametric body model and a 3D object
template, while reasoning about interactions. Furthermore, prior pixel-aligned
implicit learning methods use synthetic data and make assumptions that are not
met in real data. We propose a simple yet effective depth-aware scaling that
allows more efficient shape learning on real data. Our experiments show that
our joint reconstruction learned with the proposed strategy significantly
outperforms the SOTA. Our code and models will be released to foster future
research in this direction.Comment: 19 pages, 7 figure
一人称視点映像からの手操作解析に関する研究
学位の種別: 課程博士審査委員会委員 : (主査)国立情報学研究所教授 佐藤 真一, 東京大学教授 佐藤 洋一, 東京大学教授 相澤 清晴, 東京大学准教授 山崎 俊彦, 東京大学准教授 大石 岳史University of Tokyo(東京大学
Future Person Localization in First-Person Videos
We present a new task that predicts future locations of people observed in
first-person videos. Consider a first-person video stream continuously recorded
by a wearable camera. Given a short clip of a person that is extracted from the
complete stream, we aim to predict that person's location in future frames. To
facilitate this future person localization ability, we make the following three
key observations: a) First-person videos typically involve significant
ego-motion which greatly affects the location of the target person in future
frames; b) Scales of the target person act as a salient cue to estimate a
perspective effect in first-person videos; c) First-person videos often capture
people up-close, making it easier to leverage target poses (e.g., where they
look) for predicting their future locations. We incorporate these three
observations into a prediction framework with a multi-stream
convolution-deconvolution architecture. Experimental results reveal our method
to be effective on our new dataset as well as on a public social interaction
dataset.Comment: Accepted to CVPR 201
Cross-view and Cross-pose Completion for 3D Human Understanding
Human perception and understanding is a major domain of computer vision
which, like many other vision subdomains recently, stands to gain from the use
of large models pre-trained on large datasets. We hypothesize that the most
common pre-training strategy of relying on general purpose, object-centric
image datasets such as ImageNet, is limited by an important domain shift. On
the other hand, collecting domain specific ground truth such as 2D or 3D labels
does not scale well. Therefore, we propose a pre-training approach based on
self-supervised learning that works on human-centric data using only images.
Our method uses pairs of images of humans: the first is partially masked and
the model is trained to reconstruct the masked parts given the visible ones and
a second image. It relies on both stereoscopic (cross-view) pairs, and temporal
(cross-pose) pairs taken from videos, in order to learn priors about 3D as well
as human motion. We pre-train a model for body-centric tasks and one for
hand-centric tasks. With a generic transformer architecture, these models
outperform existing self-supervised pre-training methods on a wide set of
human-centric downstream tasks, and obtain state-of-the-art performance for
instance when fine-tuning for model-based and model-free human mesh recovery