3,554 research outputs found

    Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with Image Diffusion Model

    Full text link
    The rising demand for creating lifelike avatars in the digital realm has led to an increased need for generating high-quality human videos guided by textual descriptions and poses. We propose Dancing Avatar, designed to fabricate human motion videos driven by poses and textual cues. Our approach employs a pretrained T2I diffusion model to generate each video frame in an autoregressive fashion. The crux of innovation lies in our adept utilization of the T2I diffusion model for producing video frames successively while preserving contextual relevance. We surmount the hurdles posed by maintaining human character and clothing consistency across varying poses, along with upholding the background's continuity amidst diverse human movements. To ensure consistent human appearances across the entire video, we devise an intra-frame alignment module. This module assimilates text-guided synthesized human character knowledge into the pretrained T2I diffusion model, synergizing insights from ChatGPT. For preserving background continuity, we put forth a background alignment pipeline, amalgamating insights from segment anything and image inpainting techniques. Furthermore, we propose an inter-frame alignment module that draws inspiration from an auto-regressive pipeline to augment temporal consistency between adjacent frames, where the preceding frame guides the synthesis process of the current frame. Comparisons with state-of-the-art methods demonstrate that Dancing Avatar exhibits the capacity to generate human videos with markedly superior quality, both in terms of human and background fidelity, as well as temporal coherence compared to existing state-of-the-art approaches.Comment: 11 pages, 3 figure

    FED-NeRF: Achieve High 3D Consistency and Temporal Coherence for Face Video Editing on Dynamic NeRF

    Full text link
    The success of the GAN-NeRF structure has enabled face editing on NeRF to maintain 3D view consistency. However, achieving simultaneously multi-view consistency and temporal coherence while editing video sequences remains a formidable challenge. This paper proposes a novel face video editing architecture built upon the dynamic face GAN-NeRF structure, which effectively utilizes video sequences to restore the latent code and 3D face geometry. By editing the latent code, multi-view consistent editing on the face can be ensured, as validated by multiview stereo reconstruction on the resulting edited images in our dynamic NeRF. As the estimation of face geometries occurs on a frame-by-frame basis, this may introduce a jittering issue. We propose a stabilizer that maintains temporal coherence by preserving smooth changes of face expressions in consecutive frames. Quantitative and qualitative analyses reveal that our method, as the pioneering 4D face video editor, achieves state-of-the-art performance in comparison to existing 2D or 3D-based approaches independently addressing identity and motion. Codes will be released.Comment: Our code will be available at: https://github.com/ZHANG1023/FED-NeR

    Opportunities and challenges for using automatic human affect analysis in consumer research

    Get PDF
    The ability to automatically assess emotional responses via contact-free video recording taps into a rapidly growing market aimed at predicting consumer choices. If consumer attention and engagement are measurable in a reliable and accessible manner, relevant marketing decisions could be informed by objective data. Although significant advances have been made in automatic affect recognition, several practical and theoretical issues remain largely unresolved. These concern the lack of cross-system validation, a historical emphasis of posed over spontaneous expressions, as well as more fundamental issues regarding the weak association between subjective experience and facial expressions. To address these limitations, the present paper argues that extant commercial and free facial expression classifiers should be rigorously validated in cross-system research. Furthermore, academics and practitioners must better leverage fine-grained emotional response dynamics, with stronger emphasis on understanding naturally occurring spontaneous expressions, and in naturalistic choice settings. We posit that applied consumer research might be better situated to examine facial behavior in socio-emotional contexts rather than decontextualized, laboratory studies, and highlight how AHAA can be successfully employed in this context. Also, facial activity should be considered less as a single outcome variable, and more as a starting point for further analyses. Implications of this approach and potential obstacles that need to be overcome are discussed within the context of consumer research

    Learning Explainable Facial Features from Noisy Unconstrained Visual Data

    Get PDF
    Attributes are semantic features of objects, people, and activities. They allow computers to describe people and things in the way humans would, which makes them very useful for recognition. Facial attributes - gender, hair color, makeup, eye color, etc. - are useful for a variety of different tasks, including face verification and recognition, user interface applications, and surveillance, to name a few. The problem of predicting facial attributes is still relatively new in computer vision. Because facial attribute recognition is not a long-studied problem, a lack of publicly available data is a major challenge. As with many problems in computer vision, a large portion of facial attribute research is dedicated to improving performance on benchmark datasets. However, it has been shown that research progress on a benchmark dataset does not necessarily translate to a genuine solution for the problem. This dissertation focuses on learning models for facial attributes that are robust to changes in data, i.e. the models perform well on unseen data. We do this by taking cues from human recognition, and translating these ideas into deep learning techniques for robust facial attribute recognition. Towards this goal, we introduce several techniques for learning from noisy unconstrained visual data: utilizing relationships among attributes, a selective learning approach for multi-label balancing, a temporal coherence constraint and a motion-attention mechanism for recognizing attributes in video, and parsing faces according to attributes for improved localization. We know that facial attributes are related, e.g. heavy makeup and wearing lipstick or male and goatee. Humans are capable of recognizing and taking advantage of these relationships. For example, if a face of a subject is occluded, and facial hair can be seen, then the likelihood that the subject being male should increase. We introduce several methods for implicitly and explicitly utilizing attribute relationships for improved prediction. Some attributes are more common than others in the real world, e.g. male v. bald. These disparities are even more pronounced in datasets consisting of posed celebrities on the red carpet (i.e. there are very few celebrities not wearing makeup). These imbalances can cause a facial attribute model to learn the bias in the dataset, rather than a true representation for the attribute. To alleviate this problem, we introduce selective learning, a method of balancing each batch in a deep learning algorithm according to each attribute given a target distribution. Selective learning allows a deep learning algorithm to learn from a balanced set of data at each iteration during training, removing the bias from the label imbalance. Learning a facial attribute model from image data, and testing on video data gives unexpected results (e.g. gender changing between frames). When working with video, it is important to account for the temporal and motion aspects of the data. In order to stabilize attribute predictions in video, we utilized weakly-labeled data and introduced time and motion constraints in the model learning process. Introducing temporal coherence and motion-attention constraints during learning of an attribute model allows the use of weakly-labeled data, which is essential when working with video. Framing the problem of facial attribute recognition as one of semantic segmentation, where the goal is to predict attributes at each pixel, we are able to reduce the effect of unwanted relationships between attributes (e.g. high cheekbones and smiling ). Robust facial attribute recognition algorithms are necessary for improving the applications which use these attributes. Given limited data for training, we develop several methods for learning explainable facial features from noisy unconstrained visual data, introducing several new datasets labeled with facial attributes and improving over the state-of-the-art

    Deep Person Generation: A Survey from the Perspective of Face, Pose and Cloth Synthesis

    Full text link
    Deep person generation has attracted extensive research attention due to its wide applications in virtual agents, video conferencing, online shopping and art/movie production. With the advancement of deep learning, visual appearances (face, pose, cloth) of a person image can be easily generated or manipulated on demand. In this survey, we first summarize the scope of person generation, and then systematically review recent progress and technical trends in deep person generation, covering three major tasks: talking-head generation (face), pose-guided person generation (pose) and garment-oriented person generation (cloth). More than two hundred papers are covered for a thorough overview, and the milestone works are highlighted to witness the major technical breakthrough. Based on these fundamental tasks, a number of applications are investigated, e.g., virtual fitting, digital human, generative data augmentation. We hope this survey could shed some light on the future prospects of deep person generation, and provide a helpful foundation for full applications towards digital human
    • …
    corecore