Search CORE

54 research outputs found

Exploring Latent Cross-Channel Embedding for Accurate 3D Human Pose Reconstruction in a Diffusion Framework

Author: Chen Jie
Jiang Junkun
Publication venue
Publication date: 18/01/2024
Field of study

Monocular 3D human pose estimation poses significant challenges due to the inherent depth ambiguities that arise during the reprojection process from 2D to 3D. Conventional approaches that rely on estimating an over-fit projection matrix struggle to effectively address these challenges and often result in noisy outputs. Recent advancements in diffusion models have shown promise in incorporating structural priors to address reprojection ambiguities. However, there is still ample room for improvement as these methods often overlook the exploration of correlation between the 2D and 3D joint-level features. In this study, we propose a novel cross-channel embedding framework that aims to fully explore the correlation between joint-level features of 3D coordinates and their 2D projections. In addition, we introduce a context guidance mechanism to facilitate the propagation of joint graph attention across latent channels during the iterative diffusion process. To evaluate the effectiveness of our proposed method, we conduct experiments on two benchmark datasets, namely Human3.6M and MPI-INF-3DHP. Our results demonstrate a significant improvement in terms of reconstruction accuracy compared to state-of-the-art methods. The code for our method will be made available online for further reference

arXiv.org e-Print Archive

A Dual-Masked Auto-Encoder for Robust Motion Capture with Spatial-Temporal Skeletal Token Completion

Author: Chen Jie
Guo Yike
Jiang Junkun
Publication venue
Publication date: 15/07/2022
Field of study

Multi-person motion capture can be challenging due to ambiguities caused by severe occlusion, fast body movement, and complex interactions. Existing frameworks build on 2D pose estimations and triangulate to 3D coordinates via reasoning the appearance, trajectory, and geometric consistencies among multi-camera observations. However, 2D joint detection is usually incomplete and with wrong identity assignments due to limited observation angle, which leads to noisy 3D triangulation results. To overcome this issue, we propose to explore the short-range autoregressive characteristics of skeletal motion using transformer. First, we propose an adaptive, identity-aware triangulation module to reconstruct 3D joints and identify the missing joints for each identity. To generate complete 3D skeletal motion, we then propose a Dual-Masked Auto-Encoder (D-MAE) which encodes the joint status with both skeletal-structural and temporal position encoding for trajectory completion. D-MAE's flexible masking and encoding mechanism enable arbitrary skeleton definitions to be conveniently deployed under the same framework. In order to demonstrate the proposed model's capability in dealing with severe data loss scenarios, we contribute a high-accuracy and challenging motion capture dataset of multi-person interactions with severe occlusion. Evaluations on both benchmark and our new dataset demonstrate the efficiency of our proposed model, as well as its advantage against the other state-of-the-art methods

arXiv.org e-Print Archive

Recommended from our members

Towards Direct Simultaneous Speech Translation

Author: Chen Junkun
Publication venue: 'Oregon State University'
Publication date
Field of study

Simultaneous speech translation (SimulST) is widely useful in many cross-lingual communication scenarios, including multinational conferences and international traveling. Since text-based simultaneous machine translation (SimulMT) has achieved great success in recent years. The conventional cascaded approach for SimulST uses a pipeline of streaming ASR followed by simultaneous MT but suffers from error propagation and extra latency. Recent efforts attempt to directly translate the source speech into the target text or speech simultaneously, but this is much harder due to the combination of separate tasks. In this dissertation, we focus on improving simultaneous translation model, enabling it to handle speech input and directly generate the translated text in the target language. First, we investigate how to improve simultaneous translation by incorporating generated more monotonic pseudo references in training. These pseudo references with fewer reorderings cause fewer anticipations and can substantially improve simultaneous translation quality. Then, we propose an ASR-assisted direct SimulST framework. The model can directly translate from the given speech with a wait-k policy guided by a synchronized streaming ASR. However, speech translation tasks suffer from data scarcity problems. To alleviate the issue, we next introduce a Fused Acoustic and Text Masked Language Model (FAT-MLM), which jointly learns a unified representation for both acoustic and text input from various types of corpora, including parallel data for speech recognition and machine translation, and even pure speech and text data. By finetuning from FAT, the speech translation model can be substantially improved. Besides that, we further extend FAT to cross-lingual speech synthesis. Our proposed model can clone the voice of the source speaker and generate the corresponding speech in the target language

ScholarsArchive@OSU

ChoreoGraph: Music-conditioned Automatic Dance Choreography over a Style and Tempo Consistent Dynamic Graph

Author: Au Ho Yin
Chen Jie
Guo Yike
Jiang Junkun
Publication venue
Publication date: 15/07/2022
Field of study

To generate dance that temporally and aesthetically matches the music is a challenging problem, as the following factors need to be considered. First, the aesthetic styles and messages conveyed by the motion and music should be consistent. Second, the beats of the generated motion should be locally aligned to the musical features. And finally, basic choreomusical rules should be observed, and the motion generated should be diverse. To address these challenges, we propose ChoreoGraph, which choreographs high-quality dance motion for a given piece of music over a Dynamic Graph. A data-driven learning strategy is proposed to evaluate the aesthetic style and rhythmic connections between music and motion in a progressively learned cross-modality embedding space. The motion sequences will be beats-aligned based on the music segments and then incorporated as nodes of a Dynamic Motion Graph. Compatibility factors such as the style and tempo consistency, motion context connection, action completeness, and transition smoothness are comprehensively evaluated to determine the node transition in the graph. We demonstrate that our repertoire-based framework can generate motions with aesthetic consistency and robustly extensible in diversity. Both quantitative and qualitative experiment results show that our proposed model outperforms other baseline models

arXiv.org e-Print Archive