831 research outputs found

    Visual Imitation Learning with Recurrent Siamese Networks

    Full text link
    It would be desirable for a reinforcement learning (RL) based agent to learn behaviour by merely watching a demonstration. However, defining rewards that facilitate this goal within the RL paradigm remains a challenge. Here we address this problem with Siamese networks, trained to compute distances between observed behaviours and the agent's behaviours. Given a desired motion such Siamese networks can be used to provide a reward signal to an RL agent via the distance between the desired motion and the agent's motion. We experiment with an RNN-based comparator model that can compute distances in space and time between motion clips while training an RL policy to minimize this distance. Through experimentation, we have had also found that the inclusion of multi-task data and an additional image encoding loss helps enforce the temporal consistency. These two components appear to balance reward for matching a specific instance of behaviour versus that behaviour in general. Furthermore, we focus here on a particularly challenging form of this problem where only a single demonstration is provided for a given task -- the one-shot learning setting. We demonstrate our approach on humanoid agents in both 2D with 1010 degrees of freedom (DoF) and 3D with 3838 DoF.Comment: PrePrin

    Adversarial content manipulation for analyzing and improving model robustness

    Get PDF
    The recent rapid progress in machine learning systems has opened up many real-world applications --- from recommendation engines on web platforms to safety critical systems like autonomous vehicles. A model deployed in the real-world will often encounter inputs far from its training distribution. For example, a self-driving car might come across a black stop sign in the wild. To ensure safe operation, it is vital to quantify the robustness of machine learning models to such out-of-distribution data before releasing them into the real-world. However, the standard paradigm of benchmarking machine learning models with fixed size test sets drawn from the same distribution as the training data is insufficient to identify these corner cases efficiently. In principle, if we could generate all valid variations of an input and measure the model response, we could quantify and guarantee model robustness locally. Yet, doing this with real world data is not scalable. In this thesis, we propose an alternative, using generative models to create synthetic data variations at scale and test robustness of target models to these variations. We explore methods to generate semantic data variations in a controlled fashion across visual and text modalities. We build generative models capable of performing controlled manipulation of data like changing visual context, editing appearance of an object in images or changing writing style of text. Leveraging these generative models we propose tools to study robustness of computer vision systems to input variations and systematically identify failure modes. In the text domain, we deploy these generative models to improve diversity of image captioning systems and perform writing style manipulation to obfuscate private attributes of the user. Our studies quantifying model robustness explore two kinds of input manipulations, model-agnostic and model-targeted. The model-agnostic manipulations leverage human knowledge to choose the kinds of changes without considering the target model being tested. This includes automatically editing images to remove objects not directly relevant to the task and create variations in visual context. Alternatively, in the model-targeted approach the input variations performed are directly adversarially guided by the target model. For example, we adversarially manipulate the appearance of an object in the image to fool an object detector, guided by the gradients of the detector. Using these methods, we measure and improve the robustness of various computer vision systems -- specifically image classification, segmentation, object detection and visual question answering systems -- to semantic input variations.Der schnelle Fortschritt von Methoden des maschinellen Lernens hat viele neue Anwendungen ermöglicht – von Recommender-Systemen bis hin zu sicherheitskritischen Systemen wie autonomen Fahrzeugen. In der realen Welt werden diese Systeme oft mit Eingaben außerhalb der Verteilung der Trainingsdaten konfrontiert. Zum Beispiel könnte ein autonomes Fahrzeug einem schwarzen Stoppschild begegnen. Um sicheren Betrieb zu gewährleisten, ist es entscheidend, die Robustheit dieser Systeme zu quantifizieren, bevor sie in der Praxis eingesetzt werden. Aktuell werden diese Modelle auf festen Eingaben von derselben Verteilung wie die Trainingsdaten evaluiert. Allerdings ist diese Strategie unzureichend, um solche Ausnahmefälle zu identifizieren. Prinzipiell könnte die Robustheit “lokal” bestimmt werden, indem wir alle zulässigen Variationen einer Eingabe generieren und die Ausgabe des Systems überprüfen. Jedoch skaliert dieser Ansatz schlecht zu echten Daten. In dieser Arbeit benutzen wir generative Modelle, um synthetische Variationen von Eingaben zu erstellen und so die Robustheit eines Modells zu überprüfen. Wir erforschen Methoden, die es uns erlauben, kontrolliert semantische Änderungen an Bild- und Textdaten vorzunehmen. Wir lernen generative Modelle, die kontrollierte Manipulation von Daten ermöglichen, zum Beispiel den visuellen Kontext zu ändern, die Erscheinung eines Objekts zu bearbeiten oder den Schreibstil von Text zu ändern. Basierend auf diesen Modellen entwickeln wir neue Methoden, um die Robustheit von Bilderkennungssystemen bezüglich Variationen in den Eingaben zu untersuchen und Fehlverhalten zu identifizieren. Im Gebiet von Textdaten verwenden wir diese Modelle, um die Diversität von sogenannten Automatische Bildbeschriftung-Modellen zu verbessern und Schreibtstil-Manipulation zu erlauben, um private Attribute des Benutzers zu verschleiern. Um die Robustheit von Modellen zu quantifizieren, werden zwei Arten von Eingabemanipulationen untersucht: Modell-agnostische und Modell-spezifische Manipulationen. Modell-agnostische Manipulationen basieren auf menschlichem Wissen, um bestimmte Änderungen auszuwählen, ohne das entsprechende Modell miteinzubeziehen. Dies beinhaltet das Entfernen von für die Aufgabe irrelevanten Objekten aus Bildern oder Variationen des visuellen Kontextes. In dem alternativen Modell-spezifischen Ansatz werden Änderungen vorgenommen, die für das Modell möglichst ungünstig sind. Zum Beispiel ändern wir die Erscheinung eines Objekts um ein Modell der Objekterkennung täuschen. Dies ist durch den Gradienten des Modells möglich. Mithilfe dieser Werkzeuge können wir die Robustheit von Systemen zur Bildklassifizierung oder -segmentierung, Objekterkennung und Visuelle Fragenbeantwortung quantifizieren und verbessern

    Synthesization and reconstruction of 3D faces by deep neural networks

    Get PDF
    The past few decades have witnessed substantial progress towards 3D facial modelling and reconstruction as it is high importance for many computer vision and graphics applications including Augmented/Virtual Reality (AR/VR), computer games, movie post-production, image/video editing, medical applications, etc. In the traditional approaches, facial texture and shape are represented as triangle mesh that can cover identity and expression variation with non-rigid deformation. A dataset of 3D face scans is then densely registered into a common topology in order to construct a linear statistical model. Such models are called 3D Morphable Models (3DMMs) and can be used for 3D face synthesization or reconstruction by a single or few 2D face images. The works presented in this thesis focus on the modernization of these traditional techniques in the light of recent advances of deep learning and thanks to the availability of large-scale datasets. Ever since the introduction of 3DMMs by over two decades, there has been a lot of progress on it and they are still considered as one of the best methodologies to model 3D faces. Nevertheless, there are still several aspects of it that need to be upgraded to the "deep era". Firstly, the conventional 3DMMs are built by linear statistical approaches such as Principal Component Analysis (PCA) which omits high-frequency information by its nature. While this does not curtail shape, which is often smooth in the original data, texture models are heavily afflicted by losing high-frequency details and photorealism. Secondly, the existing 3DMM fitting approaches rely on very primitive (i.e. RGB values, sparse landmarks) or hand-crafted features (i.e. HOG, SIFT) as supervision that are sensitive to "in-the-wild" images (i.e. lighting, pose, occlusion), or somewhat missing identity/expression resemblance with the target image. Finally, shape, texture, and expression modalities are separately modelled by ignoring the correlation among them, placing a fundamental limit to the synthesization of semantically meaningful 3D faces. Moreover, photorealistic 3D face synthesis has not been studied thoroughly in the literature. This thesis attempts to address the above-mentioned issues by harnessing the power of deep neural network and generative adversarial networks as explained below: Due to the linear texture models, many of the state-of-the-art methods are still not capable of reconstructing facial textures with high-frequency details. For this, we take a radically different approach and build a high-quality texture model by Generative Adversarial Networks (GANs) that preserves details. That is, we utilize GANs to train a very powerful generator of facial texture in the UV space. And then show that it is possible to employ this generator network as a statistical texture prior to 3DMM fitting. The resulting texture reconstructions are plausible and photorealistic as GANs are faithful to the real-data distribution in both low- and high- frequency domains. Then, we revisit the conventional 3DMM fitting approaches making use of non-linear optimization to find the optimal latent parameters that best reconstruct the test image but under a new perspective. We propose to optimize the parameters with the supervision of pretrained deep identity features through our end-to-end differentiable framework. In order to be robust towards initialization and expedite the fitting process, we also propose a novel self-supervised regression-based approach. We demonstrate excellent 3D face reconstructions that are photorealistic and identity preserving and achieve for the first time, to the best of our knowledge, facial texture reconstruction with high-frequency details. In order to extend the non-linear texture model for photo-realistic 3D face synthesis, we present a methodology that generates high-quality texture, shape, and normals jointly. To do so, we propose a novel GAN that can generate data from different modalities while exploiting their correlations. Furthermore, we demonstrate how we can condition the generation on the expression and create faces with various facial expressions. Additionally, we study another approach for photo-realistic face synthesis by 3D guidance. This study proposes to generate 3D faces by linear 3DMM and then augment their 2D rendering by an image-to-image translation network to the photorealistic face domain. Both works demonstrate excellent photorealistic face synthesis and show that the generated faces are improving face recognition benchmarks as synthetic training data. Finally, we study expression reconstruction for personalized 3D face models where we improve generalization and robustness of expression encoding. First, we propose a 3D augmentation approach on 2D head-mounted camera images to increase robustness to perspective changes. And, we also propose to train generic expression encoder network by populating the number of identities with a novel multi-id personalized model training architecture in a self-supervised manner. Both approaches show promising results in both qualitative and quantitative experiments.Open Acces

    Face analysis and deepfake detection

    Get PDF
    This thesis concerns deep-learning-based face-related research topics. We explore how to improve the performance of several face systems when confronting challenging variations. In Chapter 1, we provide an introduction and background information on the theme, and we list the main research questions of this dissertation. In Chapter 2, we provide a synthetic face data generator with fully controlled variations and proposed a detailed experimental comparison of main characteristics that influence face detection performance. The result shows that our synthetic dataset could complement face detectors to become more robust against specific features in the real world. Our analysis also reveals that a variety of data augmentation is necessary to address differences in performance. In Chapter 3, we propose an age estimation method for handling large pose variations for unconstrained face images. A Wasserstein-based GAN model is used to complete the full uv texture presentation. The proposed AgeGAN method simultaneously learns to capture the facial uv texture map and age characteristics.In Chapter 4, we propose a maximum mean discrepancy (MMD) based cross-domain face forgery detection. The center and triplet losses are also incorporated to ensure that the learned features are shared by multiple domains and provide better generalization abilities to unseen deep fake samples. In Chapter 5, we introduce an end-to-end framework to predict ages from face videos. Clustering based transfer learning is used to provide proper prediction for imbalanced datasets

    Learning Invariant Representations of Images for Computational Pathology

    Get PDF

    Going Deeper into Action Recognition: A Survey

    Full text link
    Understanding human actions in visual data is tied to advances in complementary research areas including object recognition, human dynamics, domain adaptation and semantic segmentation. Over the last decade, human action analysis evolved from earlier schemes that are often limited to controlled environments to nowadays advanced solutions that can learn from millions of videos and apply to almost all daily activities. Given the broad range of applications from video surveillance to human-computer interaction, scientific milestones in action recognition are achieved more rapidly, eventually leading to the demise of what used to be good in a short time. This motivated us to provide a comprehensive review of the notable steps taken towards recognizing human actions. To this end, we start our discussion with the pioneering methods that use handcrafted representations, and then, navigate into the realm of deep learning based approaches. We aim to remain objective throughout this survey, touching upon encouraging improvements as well as inevitable fallbacks, in the hope of raising fresh questions and motivating new research directions for the reader

    A survey on generative adversarial networks for imbalance problems in computer vision tasks

    Get PDF
    Any computer vision application development starts off by acquiring images and data, then preprocessing and pattern recognition steps to perform a task. When the acquired images are highly imbalanced and not adequate, the desired task may not be achievable. Unfortunately, the occurrence of imbalance problems in acquired image datasets in certain complex real-world problems such as anomaly detection, emotion recognition, medical image analysis, fraud detection, metallic surface defect detection, disaster prediction, etc., are inevitable. The performance of computer vision algorithms can significantly deteriorate when the training dataset is imbalanced. In recent years, Generative Adversarial Neural Networks (GANs) have gained immense attention by researchers across a variety of application domains due to their capability to model complex real-world image data. It is particularly important that GANs can not only be used to generate synthetic images, but also its fascinating adversarial learning idea showed good potential in restoring balance in imbalanced datasets. In this paper, we examine the most recent developments of GANs based techniques for addressing imbalance problems in image data. The real-world challenges and implementations of synthetic image generation based on GANs are extensively covered in this survey. Our survey first introduces various imbalance problems in computer vision tasks and its existing solutions, and then examines key concepts such as deep generative image models and GANs. After that, we propose a taxonomy to summarize GANs based techniques for addressing imbalance problems in computer vision tasks into three major categories: 1. Image level imbalances in classification, 2. object level imbalances in object detection and 3. pixel level imbalances in segmentation tasks. We elaborate the imbalance problems of each group, and provide GANs based solutions in each group. Readers will understand how GANs based techniques can handle the problem of imbalances and boost performance of the computer vision algorithms

    An Analysis on Adversarial Machine Learning: Methods and Applications

    Get PDF
    Deep learning has witnessed astonishing advancement in the last decade and revolutionized many fields ranging from computer vision to natural language processing. A prominent field of research that enabled such achievements is adversarial learning, investigating the behavior and functionality of a learning model in presence of an adversary. Adversarial learning consists of two major trends. The first trend analyzes the susceptibility of machine learning models to manipulation in the decision-making process and aims to improve the robustness to such manipulations. The second trend exploits adversarial games between components of the model to enhance the learning process. This dissertation aims to provide an analysis on these two sides of adversarial learning and harness their potential for improving the robustness and generalization of deep models. In the first part of the dissertation, we study the adversarial susceptibility of deep learning models. We provide an empirical analysis on the extent of vulnerability by proposing two adversarial attacks that explore the geometric and frequency-domain characteristics of inputs to manipulate deep decisions. Afterward, we formalize the susceptibility of deep networks using the first-order approximation of the predictions and extend the theory to the ensemble classification scheme. Inspired by theoretical findings, we formalize a reliable and practical defense against adversarial examples to robustify ensembles. We extend this part by investigating the shortcomings of \gls{at} and highlight that the popular momentum stochastic gradient descent, developed essentially for natural training, is not proper for optimization in adversarial training since it is not designed to be robust against the chaotic behavior of gradients in this setup. Motivated by these observations, we develop an optimization method that is more suitable for adversarial training. In the second part of the dissertation, we harness adversarial learning to enhance the generalization and performance of deep networks in discriminative and generative tasks. We develop several models for biometric identification including fingerprint distortion rectification and latent fingerprint reconstruction. In particular, we develop a ridge reconstruction model based on generative adversarial networks that estimates the missing ridge information in latent fingerprints. We introduce a novel modification that enables the generator network to preserve the ID information during the reconstruction process. To address the scarcity of data, {\it e.g.}, in latent fingerprint analysis, we develop a supervised augmentation technique that combines input examples based on their salient regions. Our findings advocate that adversarial learning improves the performance and reliability of deep networks in a wide range of applications
    • …
    corecore