1,332 research outputs found
AI-generated Content for Various Data Modalities: A Survey
AI-generated content (AIGC) methods aim to produce text, images, videos, 3D
assets, and other media using AI algorithms. Due to its wide range of
applications and the demonstrated potential of recent works, AIGC developments
have been attracting lots of attention recently, and AIGC methods have been
developed for various data modalities, such as image, video, text, 3D shape (as
voxels, point clouds, meshes, and neural implicit fields), 3D scene, 3D human
avatar (body and head), 3D motion, and audio -- each presenting different
characteristics and challenges. Furthermore, there have also been many
significant developments in cross-modality AIGC methods, where generative
methods can receive conditioning input in one modality and produce outputs in
another. Examples include going from various modalities to image, video, 3D
shape, 3D scene, 3D avatar (body and head), 3D motion (skeleton and avatar),
and audio modalities. In this paper, we provide a comprehensive review of AIGC
methods across different data modalities, including both single-modality and
cross-modality methods, highlighting the various challenges, representative
works, and recent technical directions in each setting. We also survey the
representative datasets throughout the modalities, and present comparative
results for various modalities. Moreover, we also discuss the challenges and
potential future research directions
A Two-part Transformer Network for Controllable Motion Synthesis
Although part-based motion synthesis networks have been investigated to
reduce the complexity of modeling heterogeneous human motions, their
computational cost remains prohibitive in interactive applications. To this
end, we propose a novel two-part transformer network that aims to achieve
high-quality, controllable motion synthesis results in real-time. Our network
separates the skeleton into the upper and lower body parts, reducing the
expensive cross-part fusion operations, and models the motions of each part
separately through two streams of auto-regressive modules formed by multi-head
attention layers. However, such a design might not sufficiently capture the
correlations between the parts. We thus intentionally let the two parts share
the features of the root joint and design a consistency loss to penalize the
difference in the estimated root features and motions by these two
auto-regressive modules, significantly improving the quality of synthesized
motions. After training on our motion dataset, our network can synthesize a
wide range of heterogeneous motions, like cartwheels and twists. Experimental
and user study results demonstrate that our network is superior to
state-of-the-art human motion synthesis networks in the quality of generated
motions.Comment: 16 pages, 26 figure
UnifiedGesture: A Unified Gesture Synthesis Model for Multiple Skeletons
The automatic co-speech gesture generation draws much attention in computer
animation. Previous works designed network structures on individual datasets,
which resulted in a lack of data volume and generalizability across different
motion capture standards. In addition, it is a challenging task due to the weak
correlation between speech and gestures. To address these problems, we present
UnifiedGesture, a novel diffusion model-based speech-driven gesture synthesis
approach, trained on multiple gesture datasets with different skeletons.
Specifically, we first present a retargeting network to learn latent
homeomorphic graphs for different motion capture standards, unifying the
representations of various gestures while extending the dataset. We then
capture the correlation between speech and gestures based on a diffusion model
architecture using cross-local attention and self-attention to generate better
speech-matched and realistic gestures. To further align speech and gesture and
increase diversity, we incorporate reinforcement learning on the discrete
gesture units with a learned reward function. Extensive experiments show that
UnifiedGesture outperforms recent approaches on speech-driven gesture
generation in terms of CCA, FGD, and human-likeness. All code, pre-trained
models, databases, and demos are available to the public at
https://github.com/YoungSeng/UnifiedGesture.Comment: 16 pages, 11 figures, ACM MM 202
Skeleton-based Human Action Recognition using Basis Vectors
Automatic human action recognition is a research topic that has attracted significant attention lately, mainly due to the advancements in sensing technologies and the improvements in computational systems’ power. However, complexity in human movements, input devices’ noise and person-specific pattern variability impose a series of challenges that still remain to be overcome. In the proposed work, a novel human action recognition method using Microsoft Kinect depth sensing technology is presented for handling the above mentioned issues. Each action is represented as a basis vector and spectral analysis is performed on an affinity matrix of new action feature vectors. Using simple kernel regressors for computing the affinity matrix, complexity is reduced and robust low-dimensional representations are achieved. The proposed scheme loosens action detection accuracy demands, while it can be extended for accommodating multiple modalities, in a dynamic fashion
- …