835 research outputs found
Auto-Conditioned Recurrent Networks for Extended Complex Human Motion Synthesis
We present a real-time method for synthesizing highly complex human motions
using a novel training regime we call the auto-conditioned Recurrent Neural
Network (acRNN). Recently, researchers have attempted to synthesize new motion
by using autoregressive techniques, but existing methods tend to freeze or
diverge after a couple of seconds due to an accumulation of errors that are fed
back into the network. Furthermore, such methods have only been shown to be
reliable for relatively simple human motions, such as walking or running. In
contrast, our approach can synthesize arbitrary motions with highly complex
styles, including dances or martial arts in addition to locomotion. The acRNN
is able to accomplish this by explicitly accommodating for autoregressive noise
accumulation during training. Our work is the first to our knowledge that
demonstrates the ability to generate over 18,000 continuous frames (300
seconds) of new complex human motion w.r.t. different styles
Auto-conditioned Recurrent Mixture Density Networks for Learning Generalizable Robot Skills
Personal robots assisting humans must perform complex manipulation tasks that
are typically difficult to specify in traditional motion planning pipelines,
where multiple objectives must be met and the high-level context be taken into
consideration. Learning from demonstration (LfD) provides a promising way to
learn these kind of complex manipulation skills even from non-technical users.
However, it is challenging for existing LfD methods to efficiently learn skills
that can generalize to task specifications that are not covered by
demonstrations. In this paper, we introduce a state transition model (STM) that
generates joint-space trajectories by imitating motions from expert behavior.
Given a few demonstrations, we show in real robot experiments that the learned
STM can quickly generalize to unseen tasks and synthesize motions having longer
time horizons than the expert trajectories. Compared to conventional motion
planners, our approach enables the robot to accomplish complex behaviors from
high-level instructions without laborious hand-engineering of planning
objectives, while being able to adapt to changing goals during the skill
execution. In conjunction with a trajectory optimizer, our STM can construct a
high-quality skeleton of a trajectory that can be further improved in
smoothness and precision. In combination with a learned inverse dynamics model,
we additionally present results where the STM is used as a high-level planner.
A video of our experiments is available at https://youtu.be/85DX9Ojq-90Comment: Submitted to IROS 201
MT-VAE: Learning Motion Transformations to Generate Multimodal Human Dynamics
Long-term human motion can be represented as a series of motion
modes---motion sequences that capture short-term temporal dynamics---with
transitions between them. We leverage this structure and present a novel Motion
Transformation Variational Auto-Encoders (MT-VAE) for learning motion sequence
generation. Our model jointly learns a feature embedding for motion modes (that
the motion sequence can be reconstructed from) and a feature transformation
that represents the transition of one motion mode to the next motion mode. Our
model is able to generate multiple diverse and plausible motion sequences in
the future from the same input. We apply our approach to both facial and full
body motion, and demonstrate applications like analogy-based motion transfer
and video synthesis.Comment: Published at ECCV 201
Unsupervised Feature Learning of Human Actions as Trajectories in Pose Embedding Manifold
An unsupervised human action modeling framework can provide useful
pose-sequence representation, which can be utilized in a variety of pose
analysis applications. In this work we propose a novel temporal pose-sequence
modeling framework, which can embed the dynamics of 3D human-skeleton joints to
a continuous latent space in an efficient manner. In contrast to end-to-end
framework explored by previous works, we disentangle the task of individual
pose representation learning from the task of learning actions as a trajectory
in pose embedding space. In order to realize a continuous pose embedding
manifold with improved reconstructions, we propose an unsupervised, manifold
learning procedure named Encoder GAN, (or EnGAN). Further, we use the pose
embeddings generated by EnGAN to model human actions using a bidirectional RNN
auto-encoder architecture, PoseRNN. We introduce first-order gradient loss to
explicitly enforce temporal regularity in the predicted motion sequence. A
hierarchical feature fusion technique is also investigated for simultaneous
modeling of local skeleton joints along with global pose variations. We
demonstrate state-of-the-art transfer-ability of the learned representation
against other supervisedly and unsupervisedly learned motion embeddings for the
task of fine-grained action recognition on SBU interaction dataset. Further, we
show the qualitative strengths of the proposed framework by visualizing
skeleton pose reconstructions and interpolations in pose-embedding space, and
low dimensional principal component projections of the reconstructed pose
trajectories.Comment: Accepted at WACV 201
Learning Bidirectional LSTM Networks for Synthesizing 3D Mesh Animation Sequences
In this paper, we present a novel method for learning to synthesize 3D mesh
animation sequences with long short-term memory (LSTM) blocks and mesh-based
convolutional neural networks (CNNs). Synthesizing realistic 3D mesh animation
sequences is a challenging and important task in computer animation. To achieve
this, researchers have long been focusing on shape analysis to develop new
interpolation and extrapolation techniques. However, such techniques have
limited learning capabilities and therefore can produce unrealistic animation.
Deep architectures that operate directly on mesh sequences remain unexplored,
due to the following major barriers: meshes with irregular triangles, sequences
containing rich temporal information and flexible deformations. To address
these, we utilize convolutional neural networks defined on triangular meshes
along with a shape deformation representation to extract useful features,
followed by LSTM cells that iteratively process the features. To allow
completion of a missing mesh sequence from given endpoints, we propose a new
weight-shared bidirectional structure. The bidirectional generation loss also
helps mitigate error accumulation over iterations. Benefiting from all these
technical advances, our approach outperforms existing methods in sequence
prediction and completion both qualitatively and quantitatively. Moreover, this
network can also generate follow-up frames conditioned on initial shapes and
improve the accuracy as more bootstrap models are provided, which other works
in the geometry processing domain cannot achieve
Recurrent Transition Networks for Character Locomotion
Manually authoring transition animations for a complete locomotion system can
be a tedious and time-consuming task, especially for large games that allow
complex and constrained locomotion movements, where the number of transitions
grows exponentially with the number of states. In this paper, we present a
novel approach, based on deep recurrent neural networks, to automatically
generate such transitions given a past context of a few frames and a target
character state to reach. We present the Recurrent Transition Network (RTN),
based on a modified version of the Long-Short-Term-Memory (LSTM) network,
designed specifically for transition generation and trained without any gait,
phase, contact or action labels. We further propose a simple yet principled way
to initialize the hidden states of the LSTM layer for a given sequence which
improves the performance and generalization to new motions. We both
quantitatively and qualitatively evaluate our system and show that making the
network terrain-aware by adding a local terrain representation to the input
yields better performance for rough-terrain navigation on long transitions. Our
system produces realistic and fluid transitions that rival the quality of
Motion Capture-based ground-truth motions, even before applying any
inverse-kinematics postprocess. Direct benefits of our approach could be to
accelerate the creation of transition variations for large coverage, or even to
entirely replace transition nodes in an animation graph. We further explore
applications of this model in a animation super-resolution setting where we
temporally decompress animations saved at 1 frame per second and show that the
network is able to reconstruct motions that are hard to distinguish from
un-compressed locomotion sequences.Comment: revision fixes: clarity issues in Section 4.4 (text and equations
An Evaluation of Trajectory Prediction Approaches and Notes on the TrajNet Benchmark
In recent years, there is a shift from modeling the tracking problem based on
Bayesian formulation towards using deep neural networks. Towards this end, in
this paper the effectiveness of various deep neural networks for predicting
future pedestrian paths are evaluated. The analyzed deep networks solely rely,
like in the traditional approaches, on observed tracklets without human-human
interaction information. The evaluation is done on the publicly available
TrajNet benchmark dataset, which builds up a repository of considerable and
popular datasets for trajectory-based activity forecasting. We show that a
Recurrent-Encoder with a Dense layer stacked on top, referred to as
RED-predictor, is able to achieve sophisticated results compared to elaborated
models in such scenarios. Further, we investigate failure cases and give
explanations for observed phenomena and give some recommendations for
overcoming demonstrated shortcomings.Comment: Accepted at ECCV Workshop on Anticipating Human Behavior under
adapted title. RED: A simple but effective Baseline Predictor for the TrajNet
Benchmar
Towards 3D Dance Motion Synthesis and Control
3D human dance motion is a cooperative and elegant social movement. Unlike
regular simple locomotion, it is challenging to synthesize artistic dance
motions due to the irregularity, kinematic complexity and diversity. It
requires the synthesized dance is realistic, diverse and controllable. In this
paper, we propose a novel generative motion model based on temporal convolution
and LSTM,TC-LSTM, to synthesize realistic and diverse dance motion. We
introduce a unique control signal, dance melody line, to heighten
controllability. Hence, our model, and its switch for control signals, promote
a variety of applications: random dance synthesis, music-to-dance, user
control, and more. Our experiments demonstrate that our model can synthesize
artistic dance motion in various dance types. Compared with existing methods,
our method achieved start-of-the-art results.Comment: 9 page
Audio to Body Dynamics
We present a method that gets as input an audio of violin or piano playing,
and outputs a video of skeleton predictions which are further used to animate
an avatar. The key idea is to create an animation of an avatar that moves their
hands similarly to how a pianist or violinist would do, just from audio. Aiming
for a fully detailed correct arms and fingers motion is a goal, however, it's
not clear if body movement can be predicted from music at all. In this paper,
we present the first result that shows that natural body dynamics can be
predicted at all. We built an LSTM network that is trained on violin and piano
recital videos uploaded to the Internet. The predicted points are applied onto
a rigged avatar to create the animation.Comment: Link with videos https://arviolin.github.io/AudioBodyDynamics
To Create What You Tell: Generating Videos from Captions
We are creating multimedia contents everyday and everywhere. While automatic
content generation has played a fundamental challenge to multimedia community
for decades, recent advances of deep learning have made this problem feasible.
For example, the Generative Adversarial Networks (GANs) is a rewarding approach
to synthesize images. Nevertheless, it is not trivial when capitalizing on GANs
to generate videos. The difficulty originates from the intrinsic structure
where a video is a sequence of visually coherent and semantically dependent
frames. This motivates us to explore semantic and temporal coherence in
designing GANs to generate videos. In this paper, we present a novel Temporal
GANs conditioning on Captions, namely TGANs-C, in which the input to the
generator network is a concatenation of a latent noise vector and caption
embedding, and then is transformed into a frame sequence with 3D
spatio-temporal convolutions. Unlike the naive discriminator which only judges
pairs as fake or real, our discriminator additionally notes whether the video
matches the correct caption. In particular, the discriminator network consists
of three discriminators: video discriminator classifying realistic videos from
generated ones and optimizes video-caption matching, frame discriminator
discriminating between real and fake frames and aligning frames with the
conditioning caption, and motion discriminator emphasizing the philosophy that
the adjacent frames in the generated videos should be smoothly connected as in
real ones. We qualitatively demonstrate the capability of our TGANs-C to
generate plausible videos conditioning on the given captions on two synthetic
datasets (SBMG and TBMG) and one real-world dataset (MSVD). Moreover,
quantitative experiments on MSVD are performed to validate our proposal via
Generative Adversarial Metric and human study.Comment: ACM MM 2017 Brave New Ide
- …