23 research outputs found
3D-Aware Semantic-Guided Generative Model for Human Synthesis
Generative Neural Radiance Field (GNeRF) models, which extract implicit 3D
representations from 2D images, have recently been shown to produce realistic
images representing rigid/semi-rigid objects, such as human faces or cars.
However, they usually struggle to generate high-quality images representing
non-rigid objects, such as the human body, which is of a great interest for
many computer graphics applications. This paper proposes a 3D-aware
Semantic-Guided Generative Model (3D-SGAN) for human image synthesis, which
combines a GNeRF with a texture generator. The former learns an implicit 3D
representation of the human body and outputs a set of 2D semantic segmentation
masks. The latter transforms these semantic masks into a real image, adding a
realistic texture to the human appearance. Without requiring additional 3D
information, our model can learn 3D human representations with a
photo-realistic, controllable generation. Our experiments on the DeepFashion
dataset show that 3D-SGAN significantly outperforms the most recent baselines.
The code is available at https://github.com/zhangqianhui/3DSGANComment: ECCV 2022. 29 page
Human-controllable and structured deep generative models
Deep generative models are a class of probabilistic models that attempts to learn the underlying data distribution. These models are usually trained in an unsupervised way and thus, do not require any labels. Generative models such as Variational Autoencoders and Generative Adversarial Networks have made astounding progress over the last years. These models have several benefits: eased sampling and evaluation, efficient learning of low-dimensional representations for downstream tasks, and better understanding through interpretable representations. However, even though the quality of these models has improved immensely, the ability to control their style and structure is limited. Structured and human-controllable representations of generative models are essential for human-machine interaction and other applications, including fairness, creativity, and entertainment. This thesis investigates learning human-controllable and structured representations with deep generative models. In particular, we focus on generative modelling of 2D images. For the first part, we focus on learning clustered representations. We propose semi-parametric hierarchical variational autoencoders to estimate the intensity of facial action units. The semi-parametric model forms a hybrid generative-discriminative model and leverages both parametric Variational Autoencoder and non-parametric Gaussian Process autoencoder. We show superior performance in comparison with existing facial action unit estimation approaches. Based on the results and analysis of the learned representation, we focus on learning Mixture-of-Gaussians representations in an autoencoding framework. We deviate from the conventional autoencoding framework and consider a regularized objective with the Cauchy-Schwarz divergence. The Cauchy-Schwarz divergence allows a closed-form solution for Mixture-of-Gaussian distributions and, thus, efficiently optimizing the autoencoding objective. We show that our model outperforms existing Variational Autoencoders in density estimation, clustering, and semi-supervised facial action detection. We focus on learning disentangled representations for conditional generation and fair facial attribute classification for the second part. Conditional image generation relies on the accessibility to large-scale annotated datasets. Nevertheless, the geometry of visual objects, such as in faces, cannot be learned implicitly and deteriorate image fidelity. We propose incorporating facial landmarks with a statistical shape model and a differentiable piecewise affine transformation to separate the representation for appearance and shape. The goal of incorporating facial landmarks is that generation is controlled and can separate different appearances and geometries. In our last work, we use weak supervision for disentangling groups of variations. Works on learning disentangled representation have been done in an unsupervised fashion. However, recent works have shown that learning disentangled representations is not identifiable without any inductive biases. Since then, there has been a shift towards weakly-supervised disentanglement learning. We investigate using regularization based on the Kullback-Leiber divergence to disentangle groups of variations. The goal is to have consistent and separated subspaces for different groups, e.g., for content-style learning. Our evaluation shows increased disentanglement abilities and competitive performance for image clustering and fair facial attribute classification with weak supervision compared to supervised and semi-supervised approaches.Open Acces
Deep Learning Methods for Human Activity Recognition using Wearables
Wearable sensors provide an infrastructure-less multi-modal sensing method. Current
trends point to a pervasive integration of wearables into our lives with these devices
providing the basis for wellness and healthcare applications across rehabilitation,
caring for a growing older population, and improving human performance.
Fundamental to these applications is our ability to automatically and accurately
recognise human activities from often tiny sensors embedded in wearables. In this
dissertation, we consider the problem of human activity recognition (HAR) using
multi-channel time-series data captured by wearable sensors.
Our collective know-how regarding the solution of HAR problems with wearables has
progressed immensely through the use of deep learning paradigms. Nevertheless, this
field still faces unique methodological challenges. As such, this dissertation focuses on
developing end-to-end deep learning frameworks to promote HAR application opportunities
using wearable sensor technologies and to mitigate specific associated challenges. In our
efforts, the investigated problems cover a diverse range of HAR challenges and spans
from fully supervised to unsupervised problem domains.
In order to enhance automatic feature extraction from multi-channel time-series
data for HAR, the problem of learning enriched and highly discriminative activity
feature representations with deep neural networks is considered. Accordingly, novel
end-to-end network elements are designed which: (a) exploit the latent relationships
between multi-channel sensor modalities and specific activities, (b) employ effective
regularisation through data-agnostic augmentation for multi-modal sensor data
streams, and (c) incorporate optimization objectives to encourage minimal intra-class
representation differences, while maximising inter-class differences to achieve more
discriminative features.
In order to promote new opportunities in HAR with emerging battery-less sensing
platforms, the problem of learning from irregularly sampled and temporally sparse readings
captured by passive sensing modalities is considered. For the first time, an efficient
set-based deep learning framework is developed to address the problem. This
framework is able to learn directly from the generated data, bypassing the need for
the conventional interpolation pre-processing stage. In order to address the multi-class window problem and create potential solutions
for the challenging task of concurrent human activity recognition, the problem of
enabling simultaneous prediction of multiple activities for sensory segments is considered.
As such, the flexibility provided by the emerging set learning concepts is further
leveraged to introduce a novel formulation of HAR. This formulation treats HAR
as a set prediction problem and elegantly caters for segments carrying sensor data
from multiple activities. To address this set prediction problem, a unified deep HAR
architecture is designed that: (a) incorporates a set objective to learn mappings from
raw input sensory segments to target activity sets, and (b) precedes the supervised
learning phase with unsupervised parameter pre-training to exploit unlabelled data
for better generalisation performance.
In order to leverage the easily accessible unlabelled activity data-streams to serve
downstream classification tasks, the problem of unsupervised representation learning from
multi-channel time-series data is considered. For the first time, a novel recurrent
generative adversarial (GAN) framework is developed that explores the GAN’s latent
feature space to extract highly discriminating activity features in an unsupervised
fashion. The superiority of the learned representations is substantiated by their
ability to outperform the de facto unsupervised approaches based on autoencoder
frameworks. At the same time, they rival the recognition performance of fully
supervised trained models on downstream classification benchmarks.
In recognition of the scarcity of large-scale annotated sensor datasets and the
tediousness of collecting additional labelled data in this domain, the hitherto unexplored
problem of end-to-end clustering of human activities from unlabelled wearable data is
considered. To address this problem, a first study is presented for the purpose of
developing a stand-alone deep learning paradigm to discover semantically meaningful
clusters of human actions. In particular, the paradigm is intended to: (a) leverage
the inherently sequential nature of sensory data, (b) exploit self-supervision from
reconstruction and future prediction tasks, and (c) incorporate clustering-oriented
objectives to promote the formation of highly discriminative activity clusters. The
systematic investigations in this study create new opportunities for HAR to learn
human activities using unlabelled data that can be conveniently and cheaply collected
from wearables.Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 202
AI-generated Content for Various Data Modalities: A Survey
AI-generated content (AIGC) methods aim to produce text, images, videos, 3D
assets, and other media using AI algorithms. Due to its wide range of
applications and the demonstrated potential of recent works, AIGC developments
have been attracting lots of attention recently, and AIGC methods have been
developed for various data modalities, such as image, video, text, 3D shape (as
voxels, point clouds, meshes, and neural implicit fields), 3D scene, 3D human
avatar (body and head), 3D motion, and audio -- each presenting different
characteristics and challenges. Furthermore, there have also been many
significant developments in cross-modality AIGC methods, where generative
methods can receive conditioning input in one modality and produce outputs in
another. Examples include going from various modalities to image, video, 3D
shape, 3D scene, 3D avatar (body and head), 3D motion (skeleton and avatar),
and audio modalities. In this paper, we provide a comprehensive review of AIGC
methods across different data modalities, including both single-modality and
cross-modality methods, highlighting the various challenges, representative
works, and recent technical directions in each setting. We also survey the
representative datasets throughout the modalities, and present comparative
results for various modalities. Moreover, we also discuss the challenges and
potential future research directions
Investigating human-perceptual properties of "shapes" using 3D shapes and 2D fonts
Shapes are generally used to convey meaning. They are used in video games, films and other multimedia, in diverse ways. 3D shapes may be destined for virtual scenes or represent objects to be constructed in the real-world. Fonts add character to an otherwise plain block of text, allowing the writer to make important points more visually prominent or distinct from other text. They can indicate the structure of a document, at a glance. Rather than studying shapes through traditional geometric shape descriptors, we provide alternative methods to describe and analyse shapes, from a lens of human perception. This is done via the concepts of Schelling Points and Image Specificity. Schelling Points are choices people make when they aim to match with what they expect others to choose but cannot communicate with others to determine an answer. We study whole mesh selections in this setting, where Schelling Meshes are the most frequently selected shapes. The key idea behind image Specificity is that different images evoke different descriptions; but ‘Specific’ images yield more consistent descriptions than others. We apply Specificity to 2D fonts. We show that each concept can be learned and predict them for fonts and 3D shapes, respectively, using a depth image-based convolutional neural network. Results are shown for a range of fonts and 3D shapes and we demonstrate that font Specificity and the Schelling meshes concept are useful for visualisation, clustering, and search applications. Overall, we find that each concept represents similarities between their respective type of shape, even when there are discontinuities between the shape geometries themselves. The ‘context’ of these similarities is in some kind of abstract or subjective meaning which is consistent among different people
Deep Visual Instruments: Realtime Continuous, Meaningful Human Control over Deep Neural Networks for Creative Expression
In this thesis, we investigate Deep Learning models as an artistic medium for new modes of performative, creative expression. We call these Deep Visual Instruments: realtime interactive generative systems that exploit and leverage the capabilities of state-of-the-art Deep Neural Networks (DNN), while allowing Meaningful Human Control, in a Realtime Continuous manner. We characterise Meaningful Human Control in terms of intent, predictability, and accountability; and Realtime Continuous Control with regards to its capacity for performative interaction with immediate feedback, enhancing goal-less exploration. The capabilities of DNNs that we are looking to exploit and leverage in this manner, are their ability to learn hierarchical representations modelling highly complex, real-world data such as images. Thinking of DNNs as tools that extract useful information from massive amounts of Big Data, we investigate ways in which we can navigate and explore what useful information a DNN has learnt, and how we can meaningfully use such a model in the production of artistic and creative works, in a performative, expressive manner. We present five studies that approach this from different but complementary angles. These include: a collaborative, generative sketching application using MCTS and discriminative CNNs; a system to gesturally conduct the realtime generation of text in different styles using an ensemble of LSTM RNNs; a performative tool that allows for the manipulation of hyperparameters in realtime while a Convolutional VAE trains on a live camera feed; a live video feed processing software that allows for digital puppetry and augmented drawing; and a method that allows for long-form story telling within a generative model's latent space with meaningful control over the narrative. We frame our research with the realtime, performative expression provided by musical instruments as a metaphor, in which we think of these systems as not used by a user, but played by a performer