20 research outputs found
Text2Action: Generative Adversarial Synthesis from Language to Action
In this paper, we propose a generative model which learns the relationship
between language and human action in order to generate a human action sequence
given a sentence describing human behavior. The proposed generative model is a
generative adversarial network (GAN), which is based on the sequence to
sequence (SEQ2SEQ) model. Using the proposed generative network, we can
synthesize various actions for a robot or a virtual agent using a text encoder
recurrent neural network (RNN) and an action decoder RNN. The proposed
generative network is trained from 29,770 pairs of actions and sentence
annotations extracted from MSR-Video-to-Text (MSR-VTT), a large-scale video
dataset. We demonstrate that the network can generate human-like actions which
can be transferred to a Baxter robot, such that the robot performs an action
based on a provided sentence. Results show that the proposed generative network
correctly models the relationship between language and action and can generate
a diverse set of actions from the same sentence.Comment: 8 pages, 10 figure
Forecasting Human Dynamics from Static Images
This paper presents the first study on forecasting human dynamics from static
images. The problem is to input a single RGB image and generate a sequence of
upcoming human body poses in 3D. To address the problem, we propose the 3D Pose
Forecasting Network (3D-PFNet). Our 3D-PFNet integrates recent advances on
single-image human pose estimation and sequence prediction, and converts the 2D
predictions into 3D space. We train our 3D-PFNet using a three-step training
strategy to leverage a diverse source of training data, including image and
video based human pose datasets and 3D motion capture (MoCap) data. We
demonstrate competitive performance of our 3D-PFNet on 2D pose forecasting and
3D pose recovery through quantitative and qualitative results.Comment: Accepted in CVPR 201
Hierarchical Planning and Control for Box Loco-Manipulation
Humans perform everyday tasks using a combination of locomotion and
manipulation skills. Building a system that can handle both skills is essential
to creating virtual humans. We present a physically-simulated human capable of
solving box rearrangement tasks, which requires a combination of both skills.
We propose a hierarchical control architecture, where each level solves the
task at a different level of abstraction, and the result is a physics-based
simulated virtual human capable of rearranging boxes in a cluttered
environment. The control architecture integrates a planner, diffusion models,
and physics-based motion imitation of sparse motion clips using deep
reinforcement learning. Boxes can vary in size, weight, shape, and placement
height. Code and trained control policies are provided
C-CROC: Continuous and Convex Resolution of Centroidal Dynamic Trajectories for Legged Robots in Multicontact Scenarios
International audienceSynthesizing legged locomotion requires planning one or several steps ahead (literally): when and where, and with which effector shouldthe next contact(s) be created between the robot and the environment? Validating a contact candidate implies \textit{a minima} the resolution of a slow, non-linear optimizationproblem, to demonstrate that a Center Of Mass (COM) trajectory, compatible with the contact transition constraints, exists. We propose a conservative reformulation of this trajectory generation problem as a convex 3D linear program, CROC. It results from the observation that if the COM trajectory is a polynomial with only one free variable coefficient, the non-linearity of the problem disappears. This has two consequences. On the positive side, in terms of computation times CROC outperforms the state of the art by at least one order of magnitude, and allows to consider interactive applications (with a planning time roughly equal to the motion time). On the negative side, in our experiments our approach finds a majority of the feasible trajectories found by a non-linear solver, but not all of them. Still, we demonstrate that the solution space covered by CROC is large enough to achieve the automated planning of a large variety of locomotion tasks for different robots, demonstrated in simulation and on the real HRP-2 robot, several of which were rarely seen before.Another significant contribution is the introduction of a Bezier curve representation of the problem, which guarantees that the constraints of the COM trajectory are verified continuously, and not only at discrete points as traditionally done. This formulation is lossless, and results in more robust trajectories. It is not restricted to CROC, but could rather be integrated with any method from the state of the art
FixMyPose: Pose Correctional Captioning and Retrieval
Interest in physical therapy and individual exercises such as yoga/dance has
increased alongside the well-being trend. However, such exercises are hard to
follow without expert guidance (which is impossible to scale for personalized
feedback to every trainee remotely). Thus, automated pose correction systems
are required more than ever, and we introduce a new captioning dataset named
FixMyPose to address this need. We collect descriptions of correcting a
"current" pose to look like a "target" pose (in both English and Hindi). The
collected descriptions have interesting linguistic properties such as
egocentric relations to environment objects, analogous references, etc.,
requiring an understanding of spatial relations and commonsense knowledge about
postures. Further, to avoid ML biases, we maintain a balance across characters
with diverse demographics, who perform a variety of movements in several
interior environments (e.g., homes, offices). From our dataset, we introduce
the pose-correctional-captioning task and its reverse target-pose-retrieval
task. During the correctional-captioning task, models must generate
descriptions of how to move from the current to target pose image, whereas in
the retrieval task, models should select the correct target pose given the
initial pose and correctional description. We present strong cross-attention
baseline models (uni/multimodal, RL, multilingual) and also show that our
baselines are competitive with other models when evaluated on other
image-difference datasets. We also propose new task-specific metrics
(object-match, body-part-match, direction-match) and conduct human evaluation
for more reliable evaluation, and we demonstrate a large human-model
performance gap suggesting room for promising future work. To verify the
sim-to-real transfer of our FixMyPose dataset, we collect a set of real images
and show promising performance on these images.Comment: AAAI 2021 (18 pages, 16 figures; webpage:
https://fixmypose-unc.github.io/
Unsupervised Human Activity Analysis for Intelligent Mobile Robots
The success of intelligent mobile robots in daily living environments depends on their ability to understand human movements and behaviours. One goal of recent research is to understand human activities performed in real human environments from long term observation. We consider a human activity to be a temporally dynamic configuration of a person interacting with key objects within the environment that provide some functionality. This can be a motion
trajectory made of a sequence of 2-dimensional points representing a personâs position, as well as more detailed sequences of high-dimensional body poses, a collection of 3-dimensional points representing body joints positions, as estimated from the point of view of the robot. The limited field of view of the robot, restricted by the limitations of its sensory modalities, poses the challenge of understanding human activities from obscured, incomplete and noisy observations.
As an embedded system it also has perceptual limitations which restrict the resolution of the human activity representations it can hope to achieve. In this thesis an approach for unsupervised learning of activities implemented on an autonomous mobile robot is presented. This research makes the following novel contributions:
1) A qualitative spatial-temporal vector space encoding of human activities as observed by an
autonomous mobile robot.
2) Methods for learning a low dimensional representation of common and repeated patterns
from multiple encoded visual observations.
In order to handle the perceptual challenges, multiple abstractions are applied to the robotâs perception data. The human observations are first encoded using a leg-detector, an upper-body image classifier, and a convolutional neural network for pose estimation, while objects within
the environment are automatically segmented from a 3-dimensional point cloud representation. Central to the success of the presented framework is mapping these encodings into an abstract qualitative space in order to generalise patterns invariant to exact quantitative positions within the real world. This is performed using a number of qualitative spatial-temporal representations
which capture different aspects of the relations between the human subject and the objects in the environment. The framework auto-generates a vocabulary of discrete spatial-temporal descriptors extracted from the video sequences and each observation is represented as a vector over this vocabulary. Analogously to information retrieval on text corpora we use generative probabilistic techniques to recover latent, semantically meaningful, concepts in the encoded observations in an unsupervised manner. The relatively small number of concepts discovered are defined as multinomial distributions over the vocabulary and considered as human activity classes, granting the robot a high-level understanding of visually observed complex scenes.
We validate the framework using, 1) A dataset collected from a physical robot autonomously patrolling and performing tasks in an office environment during a six week deployment, and 2) a high-dimensional âfull body poseâ dataset captured over multiple days by a mobile robot observing a kitchen area of an office environment from multiple view points. We show that the emergent categories from our framework align well with how humans interpret behaviours andsimple activities. Our presented framework models each extended observation as a probabilistic mixture over the learned activities, meaning it can learn human activity models even when embedded in continuous video sequences without the need for manual temporal segmentation, which can be time consuming and costly. Finally, we present methods for learning such human activity models in an incremental and continuous setting using variational inference methods to update the activity distribution online. This allows the mobile robot to efficiently learn and update its models of human activity over time, discarding the raw data, allowing for life-long learning