23 research outputs found

    3D-Aware Semantic-Guided Generative Model for Human Synthesis

    Full text link
    Generative Neural Radiance Field (GNeRF) models, which extract implicit 3D representations from 2D images, have recently been shown to produce realistic images representing rigid/semi-rigid objects, such as human faces or cars. However, they usually struggle to generate high-quality images representing non-rigid objects, such as the human body, which is of a great interest for many computer graphics applications. This paper proposes a 3D-aware Semantic-Guided Generative Model (3D-SGAN) for human image synthesis, which combines a GNeRF with a texture generator. The former learns an implicit 3D representation of the human body and outputs a set of 2D semantic segmentation masks. The latter transforms these semantic masks into a real image, adding a realistic texture to the human appearance. Without requiring additional 3D information, our model can learn 3D human representations with a photo-realistic, controllable generation. Our experiments on the DeepFashion dataset show that 3D-SGAN significantly outperforms the most recent baselines. The code is available at https://github.com/zhangqianhui/3DSGANComment: ECCV 2022. 29 page

    Human-controllable and structured deep generative models

    Get PDF
    Deep generative models are a class of probabilistic models that attempts to learn the underlying data distribution. These models are usually trained in an unsupervised way and thus, do not require any labels. Generative models such as Variational Autoencoders and Generative Adversarial Networks have made astounding progress over the last years. These models have several benefits: eased sampling and evaluation, efficient learning of low-dimensional representations for downstream tasks, and better understanding through interpretable representations. However, even though the quality of these models has improved immensely, the ability to control their style and structure is limited. Structured and human-controllable representations of generative models are essential for human-machine interaction and other applications, including fairness, creativity, and entertainment. This thesis investigates learning human-controllable and structured representations with deep generative models. In particular, we focus on generative modelling of 2D images. For the first part, we focus on learning clustered representations. We propose semi-parametric hierarchical variational autoencoders to estimate the intensity of facial action units. The semi-parametric model forms a hybrid generative-discriminative model and leverages both parametric Variational Autoencoder and non-parametric Gaussian Process autoencoder. We show superior performance in comparison with existing facial action unit estimation approaches. Based on the results and analysis of the learned representation, we focus on learning Mixture-of-Gaussians representations in an autoencoding framework. We deviate from the conventional autoencoding framework and consider a regularized objective with the Cauchy-Schwarz divergence. The Cauchy-Schwarz divergence allows a closed-form solution for Mixture-of-Gaussian distributions and, thus, efficiently optimizing the autoencoding objective. We show that our model outperforms existing Variational Autoencoders in density estimation, clustering, and semi-supervised facial action detection. We focus on learning disentangled representations for conditional generation and fair facial attribute classification for the second part. Conditional image generation relies on the accessibility to large-scale annotated datasets. Nevertheless, the geometry of visual objects, such as in faces, cannot be learned implicitly and deteriorate image fidelity. We propose incorporating facial landmarks with a statistical shape model and a differentiable piecewise affine transformation to separate the representation for appearance and shape. The goal of incorporating facial landmarks is that generation is controlled and can separate different appearances and geometries. In our last work, we use weak supervision for disentangling groups of variations. Works on learning disentangled representation have been done in an unsupervised fashion. However, recent works have shown that learning disentangled representations is not identifiable without any inductive biases. Since then, there has been a shift towards weakly-supervised disentanglement learning. We investigate using regularization based on the Kullback-Leiber divergence to disentangle groups of variations. The goal is to have consistent and separated subspaces for different groups, e.g., for content-style learning. Our evaluation shows increased disentanglement abilities and competitive performance for image clustering and fair facial attribute classification with weak supervision compared to supervised and semi-supervised approaches.Open Acces

    Deep Learning Methods for Human Activity Recognition using Wearables

    Get PDF
    Wearable sensors provide an infrastructure-less multi-modal sensing method. Current trends point to a pervasive integration of wearables into our lives with these devices providing the basis for wellness and healthcare applications across rehabilitation, caring for a growing older population, and improving human performance. Fundamental to these applications is our ability to automatically and accurately recognise human activities from often tiny sensors embedded in wearables. In this dissertation, we consider the problem of human activity recognition (HAR) using multi-channel time-series data captured by wearable sensors. Our collective know-how regarding the solution of HAR problems with wearables has progressed immensely through the use of deep learning paradigms. Nevertheless, this field still faces unique methodological challenges. As such, this dissertation focuses on developing end-to-end deep learning frameworks to promote HAR application opportunities using wearable sensor technologies and to mitigate specific associated challenges. In our efforts, the investigated problems cover a diverse range of HAR challenges and spans from fully supervised to unsupervised problem domains. In order to enhance automatic feature extraction from multi-channel time-series data for HAR, the problem of learning enriched and highly discriminative activity feature representations with deep neural networks is considered. Accordingly, novel end-to-end network elements are designed which: (a) exploit the latent relationships between multi-channel sensor modalities and specific activities, (b) employ effective regularisation through data-agnostic augmentation for multi-modal sensor data streams, and (c) incorporate optimization objectives to encourage minimal intra-class representation differences, while maximising inter-class differences to achieve more discriminative features. In order to promote new opportunities in HAR with emerging battery-less sensing platforms, the problem of learning from irregularly sampled and temporally sparse readings captured by passive sensing modalities is considered. For the first time, an efficient set-based deep learning framework is developed to address the problem. This framework is able to learn directly from the generated data, bypassing the need for the conventional interpolation pre-processing stage. In order to address the multi-class window problem and create potential solutions for the challenging task of concurrent human activity recognition, the problem of enabling simultaneous prediction of multiple activities for sensory segments is considered. As such, the flexibility provided by the emerging set learning concepts is further leveraged to introduce a novel formulation of HAR. This formulation treats HAR as a set prediction problem and elegantly caters for segments carrying sensor data from multiple activities. To address this set prediction problem, a unified deep HAR architecture is designed that: (a) incorporates a set objective to learn mappings from raw input sensory segments to target activity sets, and (b) precedes the supervised learning phase with unsupervised parameter pre-training to exploit unlabelled data for better generalisation performance. In order to leverage the easily accessible unlabelled activity data-streams to serve downstream classification tasks, the problem of unsupervised representation learning from multi-channel time-series data is considered. For the first time, a novel recurrent generative adversarial (GAN) framework is developed that explores the GAN’s latent feature space to extract highly discriminating activity features in an unsupervised fashion. The superiority of the learned representations is substantiated by their ability to outperform the de facto unsupervised approaches based on autoencoder frameworks. At the same time, they rival the recognition performance of fully supervised trained models on downstream classification benchmarks. In recognition of the scarcity of large-scale annotated sensor datasets and the tediousness of collecting additional labelled data in this domain, the hitherto unexplored problem of end-to-end clustering of human activities from unlabelled wearable data is considered. To address this problem, a first study is presented for the purpose of developing a stand-alone deep learning paradigm to discover semantically meaningful clusters of human actions. In particular, the paradigm is intended to: (a) leverage the inherently sequential nature of sensory data, (b) exploit self-supervision from reconstruction and future prediction tasks, and (c) incorporate clustering-oriented objectives to promote the formation of highly discriminative activity clusters. The systematic investigations in this study create new opportunities for HAR to learn human activities using unlabelled data that can be conveniently and cheaply collected from wearables.Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 202

    AI-generated Content for Various Data Modalities: A Survey

    Full text link
    AI-generated content (AIGC) methods aim to produce text, images, videos, 3D assets, and other media using AI algorithms. Due to its wide range of applications and the demonstrated potential of recent works, AIGC developments have been attracting lots of attention recently, and AIGC methods have been developed for various data modalities, such as image, video, text, 3D shape (as voxels, point clouds, meshes, and neural implicit fields), 3D scene, 3D human avatar (body and head), 3D motion, and audio -- each presenting different characteristics and challenges. Furthermore, there have also been many significant developments in cross-modality AIGC methods, where generative methods can receive conditioning input in one modality and produce outputs in another. Examples include going from various modalities to image, video, 3D shape, 3D scene, 3D avatar (body and head), 3D motion (skeleton and avatar), and audio modalities. In this paper, we provide a comprehensive review of AIGC methods across different data modalities, including both single-modality and cross-modality methods, highlighting the various challenges, representative works, and recent technical directions in each setting. We also survey the representative datasets throughout the modalities, and present comparative results for various modalities. Moreover, we also discuss the challenges and potential future research directions

    Investigating human-perceptual properties of "shapes" using 3D shapes and 2D fonts

    Get PDF
    Shapes are generally used to convey meaning. They are used in video games, films and other multimedia, in diverse ways. 3D shapes may be destined for virtual scenes or represent objects to be constructed in the real-world. Fonts add character to an otherwise plain block of text, allowing the writer to make important points more visually prominent or distinct from other text. They can indicate the structure of a document, at a glance. Rather than studying shapes through traditional geometric shape descriptors, we provide alternative methods to describe and analyse shapes, from a lens of human perception. This is done via the concepts of Schelling Points and Image Specificity. Schelling Points are choices people make when they aim to match with what they expect others to choose but cannot communicate with others to determine an answer. We study whole mesh selections in this setting, where Schelling Meshes are the most frequently selected shapes. The key idea behind image Specificity is that different images evoke different descriptions; but ‘Specific’ images yield more consistent descriptions than others. We apply Specificity to 2D fonts. We show that each concept can be learned and predict them for fonts and 3D shapes, respectively, using a depth image-based convolutional neural network. Results are shown for a range of fonts and 3D shapes and we demonstrate that font Specificity and the Schelling meshes concept are useful for visualisation, clustering, and search applications. Overall, we find that each concept represents similarities between their respective type of shape, even when there are discontinuities between the shape geometries themselves. The ‘context’ of these similarities is in some kind of abstract or subjective meaning which is consistent among different people

    Deep Visual Instruments: Realtime Continuous, Meaningful Human Control over Deep Neural Networks for Creative Expression

    Get PDF
    In this thesis, we investigate Deep Learning models as an artistic medium for new modes of performative, creative expression. We call these Deep Visual Instruments: realtime interactive generative systems that exploit and leverage the capabilities of state-of-the-art Deep Neural Networks (DNN), while allowing Meaningful Human Control, in a Realtime Continuous manner. We characterise Meaningful Human Control in terms of intent, predictability, and accountability; and Realtime Continuous Control with regards to its capacity for performative interaction with immediate feedback, enhancing goal-less exploration. The capabilities of DNNs that we are looking to exploit and leverage in this manner, are their ability to learn hierarchical representations modelling highly complex, real-world data such as images. Thinking of DNNs as tools that extract useful information from massive amounts of Big Data, we investigate ways in which we can navigate and explore what useful information a DNN has learnt, and how we can meaningfully use such a model in the production of artistic and creative works, in a performative, expressive manner. We present five studies that approach this from different but complementary angles. These include: a collaborative, generative sketching application using MCTS and discriminative CNNs; a system to gesturally conduct the realtime generation of text in different styles using an ensemble of LSTM RNNs; a performative tool that allows for the manipulation of hyperparameters in realtime while a Convolutional VAE trains on a live camera feed; a live video feed processing software that allows for digital puppetry and augmented drawing; and a method that allows for long-form story telling within a generative model's latent space with meaningful control over the narrative. We frame our research with the realtime, performative expression provided by musical instruments as a metaphor, in which we think of these systems as not used by a user, but played by a performer
    corecore