520 research outputs found

    Object-centric generative models for robot perception and action

    Get PDF
    The system of robot manipulation involves a pipeline consisting of the perception of objects in the environment and the planning of actions in 3D space. Deep learning approaches are employed to segment scenes into components of objects and then learn object-centric features to predict actions for downstream tasks. Despite having achieved promising performance in several manipulation tasks, supervised approaches lack inductive biases related to general properties of objects. Recent advances show that by encoding and reconstructing scenes in an object-centric fashion, the model can discover object-like entities from raw data without human supervision. Moreover, by reconstructing the discovered objects, the model can learn a variational latent space that captures the various shapes and textures of the objects, regularised by a chosen prior distribution. In this thesis, we investigate the properties of this learned object-centric latent space and develop novel object-centric generative models (OCGMs) that can be applied to real-world robotics scenarios. In the first part of this thesis, we investigate a tool-synthesis task which leverages a learned latent space to optimise a wide range of tools applied to a reaching task. Given an image that illustrates the obstacles and the reaching target in the scene, an affordance predictor is trained to predict the feasibility of the tool for the given task. To imitate human tool-use experiences, feasibility labels are acquired from simulated trial-and-errors of the reaching task. We found that by employing an activation maximisation step, the model can synthesis proper tools for the given tasks with high accuracy. Moreover, the tool-synthesis process indicates the existence of a task-relevant trajectory in the learned latent space that can be found by a trained affordance predictor. The second part of this thesis focuses on the development of novel OCGMs and their applications to robotic tasks. We first introduce a 2D OCGM that is deployed to robot manipulation datasets in both simulation and real-world scenarios. Despite the intensive interactions between robot arm and objects, we find the model discovers meaningful object entities from the raw observations without any human supervision. We next upgrade the 2D OCGM to 3D by leveraging NeRFs as decoders to explicitly model the 3D geometry of objects and the background. To disentangle the object spatial information from its appearance information, we propose a minimum volume principle for unsupervised 6D pose estimation of the objects. Considering the occlusion in the scene, we further improve the pose estimation by introducing a shape completion module to imagine the unobserved parts of the objects before the pose estimation step. In the end, we successfully apply the model in real-world robotics scenarios and compare its performance in several tasks including the 3D reconstruction, object-centric latent representation learning, 6D pose estimation for object rearrangement, against several baselines. We find that despite being an unsupervised approach, our model achieves improved performance across a range of different real-world tasks

    ObPose: Leveraging Pose for Object-Centric Scene Inference in 3D

    Full text link
    We present ObPose, an unsupervised object-centric inference and generation model which learns 3D-structured latent representations from RGB-D scenes. Inspired by prior art in 2D representation learning, ObPose considers a factorised latent space, separately encoding object location (where) and appearance (what). ObPose further leverages an object's pose (i.e. location and orientation), defined via a minimum volume principle, as a novel inductive bias for learning the where component. To achieve this, we propose an efficient, voxelised approximation approach to recover the object shape directly from a neural radiance field (NeRF). As a consequence, ObPose models each scene as a composition of NeRFs, richly representing individual objects. To evaluate the quality of the learned representations, ObPose is evaluated quantitatively on the YCB and CLEVR datatasets for unsupervised scene segmentation, outperforming the current state-of-the-art in 3D scene inference (ObSuRF) by a significant margin. Generative results provide qualitative demonstration that the same ObPose model can both generate novel scenes and flexibly edit the objects in them. These capacities again reflect the quality of the learned latents and the benefits of disentangling the where and what components of a scene. Key design choices made in the ObPose encoder are validated with ablations.Comment: 19 pages, 9 figure

    Groupwise non-rigid registration for automatic construction of appearance models of the human craniofacial complex for analysis, synthesis and simulation

    Get PDF
    Finally, a novel application of 3D appearance modelling is proposed: a faster than real-time algorithm for statistically constrained quasi-mechanical simulation. Experiments demonstrate superior realism, achieved in the proposed method by employing statistical appearance models to drive the simulation, in comparison with the comparable state-of-the-art quasi-mechanical approaches.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Knowledge-Guided Data-Centric AI in Healthcare: Progress, Shortcomings, and Future Directions

    Full text link
    The success of deep learning is largely due to the availability of large amounts of training data that cover a wide range of examples of a particular concept or meaning. In the field of medicine, having a diverse set of training data on a particular disease can lead to the development of a model that is able to accurately predict the disease. However, despite the potential benefits, there have not been significant advances in image-based diagnosis due to a lack of high-quality annotated data. This article highlights the importance of using a data-centric approach to improve the quality of data representations, particularly in cases where the available data is limited. To address this "small-data" issue, we discuss four methods for generating and aggregating training data: data augmentation, transfer learning, federated learning, and GANs (generative adversarial networks). We also propose the use of knowledge-guided GANs to incorporate domain knowledge in the training data generation process. With the recent progress in large pre-trained language models, we believe it is possible to acquire high-quality knowledge that can be used to improve the effectiveness of knowledge-guided generative methods.Comment: 21 pages, 13 figures, 4 table

    Statistical modelling for facial expression dynamics

    Get PDF
    PhDOne of the most powerful and fastest means of relaying emotions between humans are facial expressions. The ability to capture, understand and mimic those emotions and their underlying dynamics in the synthetic counterpart is a challenging task because of the complexity of human emotions, different ways of conveying them, non-linearities caused by facial feature and head motion, and the ever critical eye of the viewer. This thesis sets out to address some of the limitations of existing techniques by investigating three components of expression modelling and parameterisation framework: (1) Feature and expression manifold representation, (2) Pose estimation, and (3) Expression dynamics modelling and their parameterisation for the purpose of driving a synthetic head avatar. First, we introduce a hierarchical representation based on the Point Distribution Model (PDM). Holistic representations imply that non-linearities caused by the motion of facial features, and intrafeature correlations are implicitly embedded and hence have to be accounted for in the resulting expression space. Also such representations require large training datasets to account for all possible variations. To address those shortcomings, and to provide a basis for learning more subtle, localised variations, our representation consists of tree-like structure where a holistic root component is decomposed into leaves containing the jaw outline, each of the eye and eyebrows and the mouth. Each of the hierarchical components is modelled according to its intrinsic functionality, rather than the final, holistic expression label. Secondly, we introduce a statistical approach for capturing an underlying low-dimension expression manifold by utilising components of the previously defined hierarchical representation. As Principal Component Analysis (PCA) based approaches cannot reliably capture variations caused by large facial feature changes because of its linear nature, the underlying dynamics manifold for each of the hierarchical components is modelled using a Hierarchical Latent Variable Model (HLVM) approach. Whilst retaining PCA properties, such a model introduces a probability density model which can deal with missing or incomplete data and allows discovery of internal within cluster structures. All of the model parameters and underlying density model are automatically estimated during the training stage. We investigate the usefulness of such a model to larger and unseen datasets. Thirdly, we extend the concept of HLVM model to pose estimation to address the non-linear shape deformations and definition of the plausible pose space caused by large head motion. Since our head rarely stays still, and its movements are intrinsically connected with the way we perceive and understand the expressions, pose information is an integral part of their dynamics. The proposed 3 approach integrates into our existing hierarchical representation model. It is learned using sparse and discreetly sampled training dataset, and generalises to a larger and continuous view-sphere. Finally, we introduce a framework that models and extracts expression dynamics. In existing frameworks, explicit definition of expression intensity and pose information, is often overlooked, although usually implicitly embedded in the underlying representation. We investigate modelling of the expression dynamics based on use of static information only, and focus on its sufficiency for the task at hand. We compare a rule-based method that utilises the existing latent structure and provides a fusion of different components with holistic and Bayesian Network (BN) approaches. An Active Appearance Model (AAM) based tracker is used to extract relevant information from input sequences. Such information is subsequently used to define the parametric structure of the underlying expression dynamics. We demonstrate that such information can be utilised to animate a synthetic head avatar. Submitte

    3D Shape Variational Autoencoder Latent Disentanglement via Mini-Batch Feature Swapping for Bodies and Faces

    Get PDF
    Learning a disentangled, interpretable, and structured latent representation in 3D generative models of faces and bodies is still an open problem. The problem is particularly acute when control over identity features is required. In this paper, we propose an intuitive yet effective self-supervised approach to train a 3D shape variational autoencoder (VAE) which encourages a disentangled latent representation of identity features. Curating the mini-batch generation by swapping arbitrary features across different shapes allows to define a loss function leveraging known differences and similarities in the latent representations. Experimental results conducted on 3D meshes show that state-of-the-art methods for latent disentanglement are not able to disentangle identity features of faces and bodies. Our proposed method properly decouples the generation of such features while maintaining good representation and reconstruction capabilities
    • …
    corecore