131 research outputs found

    Deep Person Generation: A Survey from the Perspective of Face, Pose and Cloth Synthesis

    Full text link
    Deep person generation has attracted extensive research attention due to its wide applications in virtual agents, video conferencing, online shopping and art/movie production. With the advancement of deep learning, visual appearances (face, pose, cloth) of a person image can be easily generated or manipulated on demand. In this survey, we first summarize the scope of person generation, and then systematically review recent progress and technical trends in deep person generation, covering three major tasks: talking-head generation (face), pose-guided person generation (pose) and garment-oriented person generation (cloth). More than two hundred papers are covered for a thorough overview, and the milestone works are highlighted to witness the major technical breakthrough. Based on these fundamental tasks, a number of applications are investigated, e.g., virtual fitting, digital human, generative data augmentation. We hope this survey could shed some light on the future prospects of deep person generation, and provide a helpful foundation for full applications towards digital human

    SCULPT: Shape-Conditioned Unpaired Learning of Pose-dependent Clothed and Textured Human Meshes

    Full text link
    We present SCULPT, a novel 3D generative model for clothed and textured 3D meshes of humans. Specifically, we devise a deep neural network that learns to represent the geometry and appearance distribution of clothed human bodies. Training such a model is challenging, as datasets of textured 3D meshes for humans are limited in size and accessibility. Our key observation is that there exist medium-sized 3D scan datasets like CAPE, as well as large-scale 2D image datasets of clothed humans and multiple appearances can be mapped to a single geometry. To effectively learn from the two data modalities, we propose an unpaired learning procedure for pose-dependent clothed and textured human meshes. Specifically, we learn a pose-dependent geometry space from 3D scan data. We represent this as per vertex displacements w.r.t. the SMPL model. Next, we train a geometry conditioned texture generator in an unsupervised way using the 2D image data. We use intermediate activations of the learned geometry model to condition our texture generator. To alleviate entanglement between pose and clothing type, and pose and clothing appearance, we condition both the texture and geometry generators with attribute labels such as clothing types for the geometry, and clothing colors for the texture generator. We automatically generated these conditioning labels for the 2D images based on the visual question answering model BLIP and CLIP. We validate our method on the SCULPT dataset, and compare to state-of-the-art 3D generative models for clothed human bodies. We will release the codebase for research purposes

    RenderMe-360: A Large Digital Asset Library and Benchmarks Towards High-fidelity Head Avatars

    Full text link
    Synthesizing high-fidelity head avatars is a central problem for computer vision and graphics. While head avatar synthesis algorithms have advanced rapidly, the best ones still face great obstacles in real-world scenarios. One of the vital causes is inadequate datasets -- 1) current public datasets can only support researchers to explore high-fidelity head avatars in one or two task directions; 2) these datasets usually contain digital head assets with limited data volume, and narrow distribution over different attributes. In this paper, we present RenderMe-360, a comprehensive 4D human head dataset to drive advance in head avatar research. It contains massive data assets, with 243+ million complete head frames, and over 800k video sequences from 500 different identities captured by synchronized multi-view cameras at 30 FPS. It is a large-scale digital library for head avatars with three key attributes: 1) High Fidelity: all subjects are captured by 60 synchronized, high-resolution 2K cameras in 360 degrees. 2) High Diversity: The collected subjects vary from different ages, eras, ethnicities, and cultures, providing abundant materials with distinctive styles in appearance and geometry. Moreover, each subject is asked to perform various motions, such as expressions and head rotations, which further extend the richness of assets. 3) Rich Annotations: we provide annotations with different granularities: cameras' parameters, matting, scan, 2D/3D facial landmarks, FLAME fitting, and text description. Based on the dataset, we build a comprehensive benchmark for head avatar research, with 16 state-of-the-art methods performed on five main tasks: novel view synthesis, novel expression synthesis, hair rendering, hair editing, and talking head generation. Our experiments uncover the strengths and weaknesses of current methods. RenderMe-360 opens the door for future exploration in head avatars.Comment: Technical Report; Project Page: 36; Github Link: https://github.com/RenderMe-360/RenderMe-36

    Virtual Try-On With Generative Adversarial Networks: A Taxonomical Survey

    Get PDF
    This chapter elaborates on using generative adversarial networks (GAN) for virtual try-on applications. It presents the first comprehensive survey on this topic. Virtual try-on represents a practical application of GANs and pixel translation, which improves on the techniques of virtual try-on prior to these new discoveries. This survey details the importance of virtual try-on systems and the history of virtual try-on; shows how GANs, pixel translation, and perceptual losses have influenced the field; and summarizes the latest research in creating virtual try-on systems. Additionally, the authors present the future directions of research to improve virtual try-on systems by making them usable, faster, more effective. By walking through the steps of virtual try-on from start to finish, the chapter aims to expose readers to key concepts shared by many GAN applications and to give readers a solid foundation to pursue further topics in GANs

    Bridging the gap between reconstruction and synthesis

    Get PDF
    Aplicat embargament des de la data de defensa fins el 15 de gener de 20223D reconstruction and image synthesis are two of the main pillars in computer vision. Early works focused on simple tasks such as multi-view reconstruction and texture synthesis. With the spur of Deep Learning, the field has rapidly progressed, making it possible to achieve more complex and high level tasks. For example, the 3D reconstruction results of traditional multi-view approaches are currently obtained with single view methods. Similarly, early pattern based texture synthesis works have resulted in techniques that allow generating novel high-resolution images. In this thesis we have developed a hierarchy of tools that cover all these range of problems, lying at the intersection of computer vision, graphics and machine learning. We tackle the problem of 3D reconstruction and synthesis in the wild. Importantly, we advocate for a paradigm in which not everything should be learned. Instead of applying Deep Learning naively we propose novel representations, layers and architectures that directly embed prior 3D geometric knowledge for the task of 3D reconstruction and synthesis. We apply these techniques to problems including scene/person reconstruction and photo-realistic rendering. We first address methods to reconstruct a scene and the clothed people in it while estimating the camera position. Then, we tackle image and video synthesis for clothed people in the wild. Finally, we bridge the gap between reconstruction and synthesis under the umbrella of a unique novel formulation. Extensive experiments conducted along this thesis show that the proposed techniques improve the performance of Deep Learning models in terms of the quality of the reconstructed 3D shapes / synthesised images, while reducing the amount of supervision and training data required to train them. In summary, we provide a variety of low, mid and high level algorithms that can be used to incorporate prior knowledge into different stages of the Deep Learning pipeline and improve performance in tasks of 3D reconstruction and image synthesis.La reconstrucció 3D i la síntesi d'imatges són dos dels pilars fonamentals en visió per computador. Els estudis previs es centren en tasques senzilles com la reconstrucció amb informació multi-càmera i la síntesi de textures. Amb l'aparició del "Deep Learning", aquest camp ha progressat ràpidament, fent possible assolir tasques molt més complexes. Per exemple, per obtenir una reconstrucció 3D, tradicionalment s'utilitzaven mètodes multi-càmera, en canvi ara, es poden obtenir a partir d'una sola imatge. De la mateixa manera, els primers treballs de síntesi de textures basats en patrons han donat lloc a tècniques que permeten generar noves imatges completes en alta resolució. En aquesta tesi, hem desenvolupat una sèrie d'eines que cobreixen tot aquest ventall de problemes, situats en la intersecció entre la visió per computador, els gràfics i l'aprenentatge automàtic. Abordem el problema de la reconstrucció i la síntesi 3D en el món real. És important destacar que defensem un paradigma on no tot s'ha d'aprendre. Enlloc d'aplicar el "Deep Learning" de forma naïve, proposem representacions novedoses i arquitectures que incorporen directament els coneixements geomètrics ja existents per a aconseguir la reconstrucció 3D i la síntesi d'imatges. Nosaltres apliquem aquestes tècniques a problemes com ara la reconstrucció d'escenes/persones i a la renderització d'imatges fotorealistes. Primer abordem els mètodes per reconstruir una escena, les persones vestides que hi ha i la posició de la càmera. A continuació, abordem la síntesi d'imatges i vídeos de persones vestides en situacions quotidianes. I finalment, aconseguim, a través d'una nova formulació única, connectar la reconstrucció amb la síntesi. Els experiments realitzats al llarg d'aquesta tesi demostren que les tècniques proposades milloren el rendiment dels models de "Deepp Learning" pel que fa a la qualitat de les reconstruccions i les imatges sintetitzades alhora que redueixen la quantitat de dades necessàries per entrenar-los. En resum, proporcionem una varietat d'algoritmes de baix, mitjà i alt nivell que es poden utilitzar per incorporar els coneixements previs a les diferents etapes del "Deep Learning" i millorar el rendiment en tasques de reconstrucció 3D i síntesi d'imatges.Postprint (published version

    Soft Biometric Analysis: MultiPerson and RealTime Pedestrian Attribute Recognition in Crowded Urban Environments

    Get PDF
    Traditionally, recognition systems were only based on human hard biometrics. However, the ubiquitous CCTV cameras have raised the desire to analyze human biometrics from far distances, without people attendance in the acquisition process. Highresolution face closeshots are rarely available at far distances such that facebased systems cannot provide reliable results in surveillance applications. Human soft biometrics such as body and clothing attributes are believed to be more effective in analyzing human data collected by security cameras. This thesis contributes to the human soft biometric analysis in uncontrolled environments and mainly focuses on two tasks: Pedestrian Attribute Recognition (PAR) and person reidentification (reid). We first review the literature of both tasks and highlight the history of advancements, recent developments, and the existing benchmarks. PAR and person reid difficulties are due to significant distances between intraclass samples, which originate from variations in several factors such as body pose, illumination, background, occlusion, and data resolution. Recent stateoftheart approaches present endtoend models that can extract discriminative and comprehensive feature representations from people. The correlation between different regions of the body and dealing with limited learning data is also the objective of many recent works. Moreover, class imbalance and correlation between human attributes are specific challenges associated with the PAR problem. We collect a large surveillance dataset to train a novel gender recognition model suitable for uncontrolled environments. We propose a deep residual network that extracts several posewise patches from samples and obtains a comprehensive feature representation. In the next step, we develop a model for multiple attribute recognition at once. Considering the correlation between human semantic attributes and class imbalance, we respectively use a multitask model and a weighted loss function. We also propose a multiplication layer on top of the backbone features extraction layers to exclude the background features from the final representation of samples and draw the attention of the model to the foreground area. We address the problem of person reid by implicitly defining the receptive fields of deep learning classification frameworks. The receptive fields of deep learning models determine the most significant regions of the input data for providing correct decisions. Therefore, we synthesize a set of learning data in which the destructive regions (e.g., background) in each pair of instances are interchanged. A segmentation module determines destructive and useful regions in each sample, and the label of synthesized instances are inherited from the sample that shared the useful regions in the synthesized image. The synthesized learning data are then used in the learning phase and help the model rapidly learn that the identity and background regions are not correlated. Meanwhile, the proposed solution could be seen as a data augmentation approach that fully preserves the label information and is compatible with other data augmentation techniques. When reid methods are learned in scenarios where the target person appears with identical garments in the gallery, the visual appearance of clothes is given the most importance in the final feature representation. Clothbased representations are not reliable in the longterm reid settings as people may change their clothes. Therefore, developing solutions that ignore clothing cues and focus on identityrelevant features are in demand. We transform the original data such that the identityrelevant information of people (e.g., face and body shape) are removed, while the identityunrelated cues (i.e., color and texture of clothes) remain unchanged. A learned model on the synthesized dataset predicts the identityunrelated cues (shortterm features). Therefore, we train a second model coupled with the first model and learns the embeddings of the original data such that the similarity between the embeddings of the original and synthesized data is minimized. This way, the second model predicts based on the identityrelated (longterm) representation of people. To evaluate the performance of the proposed models, we use PAR and person reid datasets, namely BIODI, PETA, RAP, Market1501, MSMTV2, PRCC, LTCC, and MIT and compared our experimental results with stateoftheart methods in the field. In conclusion, the data collected from surveillance cameras have low resolution, such that the extraction of hard biometric features is not possible, and facebased approaches produce poor results. In contrast, soft biometrics are robust to variations in data quality. So, we propose approaches both for PAR and person reid to learn discriminative features from each instance and evaluate our proposed solutions on several publicly available benchmarks.This thesis was prepared at the University of Beria Interior, IT Instituto de Telecomunicações, Soft Computing and Image Analysis Laboratory (SOCIA Lab), Covilhã Delegation, and was submitted to the University of Beira Interior for defense in a public examination session

    On Patching Learning Discrepancies in Neural Network Training

    Get PDF
    Neural network\u27s ability to model data patterns proved to be immensely useful in a plethora of practical applications. However, using the physical world\u27s data can be problematic since it is often cluttered, crowded with scattered insignificant patterns, contain unusual compositions, and widely infiltrated with biases and imbalances. Consequently, training a neural network to find meaningful patterns in seas of chaotic data points becomes virtually as hard as finding a needle in a haystack. Specifically, attempting to simulate real-world multi-modal noisy distributions with high precision leads the network to learn an ill-informed inference distribution. In this work, we discuss four techniques to mitigate common discrepancies between real-world representations and the training distribution learned by the network. Namely, we address the techniques of Diverse sampling, objective generalization, domain, and task adaptation being introduced as priors in learning the primary objective. For each of these techniques, we contrast the basic training where no prior is applied to the learning with our proposed method and show the advantage of guiding the training distribution to the critical patterns in real-world data using our suggested approaches. We examine those discrepancy-mitigation techniques on a variety of vision tasks ranging from image generation and retrieval to video summarization and actionness ranking

    LookBook: pioneering Inclusive beauty with artificial intelligence and machine learning algorithms

    Get PDF
    Technology's imperfections and biases inherited from historical norms are crucial to acknowledge. Rapid perpetuation and amplification of these biases necessitate transparency and proactive measures to mitigate their impact. The online visual culture reinforces Eurocentric beauty ideals through prioritized algorithms and augmented reality filters, distorting reality and perpetuating unrealistic standards of beauty. Narrow beauty standards in technology pose a significant challenge to overcome. Algorithms personalize content, creating "filter bubbles" that reinforce these ideals and limit exposure to diverse representations of beauty. This cycle compels individuals to conform, hindering the embrace of their unique features and alternative definitions of beauty. LookBook counters prevalent narrow beauty standards in technology. It promotes inclusivity and representation through self-expression, community engagement, and diverse visibility. LookBook comprises three core sections: Dash, Books, and Community. In Dash, users curate their experience through personalization algorithms. Books allow users to collect curated content for inspiration and creativity, while Community fosters connections with like-minded individuals. Through LookBook, users create a reality aligned with their unique vision. They control consumed content, nurturing individualism through preferences and creativity. This personalization empowers individuals to break free from narrow beauty standards and embrace their distinctiveness. LookBook stands out with its algorithmic training and data representation. It offers transparency on how personalization algorithms operate and ensures a balanced and diverse representation of physicalities and ethnicities. By addressing biases and embracing a wide range of identities, LookBook sparks a conversation for a technology landscape that amplifies all voices, fostering an environment celebrating diversity and prioritizing inclusivity
    corecore