10 research outputs found

    Self-supervised Learning of Interpretable Keypoints from Unlabelled Videos

    Get PDF
    We propose KeypointGAN, a new method for recognizing the pose of objects from a single image that for learning uses only unlabelled videos and a weak empirical prior on the object poses. Video frames differ primarily in the pose of the objects they contain, so our method distils the pose information by analyzing the differences between frames. The distillation uses a new dual representation of the geometry of objects as a set of 2D keypoints, and as a pictorial representation, i.e. a skeleton image. This has three benefits: (1) it provides a tight `geometric bottleneck' which disentangles pose from appearance, (2) it can leverage powerful image-to-image translation networks to map between photometry and geometry, and (3) it allows to incorporate empirical pose priors in the learning process. The pose priors are obtained from unpaired data, such as from a different dataset or modality such as mocap, such that no annotated image is ever used in learning the pose recognition network. In standard benchmarks for pose recognition for humans and faces, our method achieves state-of-the-art performance among methods that do not require any labelled images for training.Comment: CVPR 2020 (oral). Project page: http://www.robots.ox.ac.uk/~vgg/research/unsupervised_pose

    Shape Consistent 2D Keypoint Estimation under Domain Shift

    Full text link
    Recent unsupervised domain adaptation methods based on deep architectures have shown remarkable performance not only in traditional classification tasks but also in more complex problems involving structured predictions (e.g. semantic segmentation, depth estimation). Following this trend, in this paper we present a novel deep adaptation framework for estimating keypoints under domain shift}, i.e. when the training (source) and the test (target) images significantly differ in terms of visual appearance. Our method seamlessly combines three different components: feature alignment, adversarial training and self-supervision. Specifically, our deep architecture leverages from domain-specific distribution alignment layers to perform target adaptation at the feature level. Furthermore, a novel loss is proposed which combines an adversarial term for ensuring aligned predictions in the output space and a geometric consistency term which guarantees coherent predictions between a target sample and its perturbed version. Our extensive experimental evaluation conducted on three publicly available benchmarks shows that our approach outperforms state-of-the-art domain adaptation methods in the 2D keypoint prediction task

    Relation-Based Associative Joint Location for Human Pose Estimation in Videos

    Full text link
    Video-based human pose estimation (HPE) is a vital yet challenging task. While deep learning methods have made significant progress for the HPE, most approaches to this task detect each joint independently, damaging the pose structural information. In this paper, unlike the prior methods, we propose a Relation-based Pose Semantics Transfer Network (RPSTN) to locate joints associatively. Specifically, we design a lightweight joint relation extractor (JRE) to model the pose structural features and associatively generate heatmaps for joints by modeling the relation between any two joints heuristically instead of building each joint heatmap independently. Actually, the proposed JRE module models the spatial configuration of human poses through the relationship between any two joints. Moreover, considering the temporal semantic continuity of videos, the pose semantic information in the current frame is beneficial for guiding the location of joints in the next frame. Therefore, we use the idea of knowledge reuse to propagate the pose semantic information between consecutive frames. In this way, the proposed RPSTN captures temporal dynamics of poses. On the one hand, the JRE module can infer invisible joints according to the relationship between the invisible joints and other visible joints in space. On the other hand, in the time, the propose model can transfer the pose semantic features from the non-occluded frame to the occluded frame to locate occluded joints. Therefore, our method is robust to the occlusion and achieves state-of-the-art results on the two challenging datasets, which demonstrates its effectiveness for video-based human pose estimation. We will release the code and models publicly

    Landmarks Augmentation with Manifold-Barycentric Oversampling

    Full text link
    The training of Generative Adversarial Networks (GANs) requires a large amount of data, stimulating the development of new augmentation methods to alleviate the challenge. Oftentimes, these methods either fail to produce enough new data or expand the dataset beyond the original manifold. In this paper, we propose a new augmentation method that guarantees to keep the new data within the original data manifold thanks to the optimal transport theory. The proposed algorithm finds cliques in the nearest-neighbors graph and, at each sampling iteration, randomly draws one clique to compute the Wasserstein barycenter with random uniform weights. These barycenters then become the new natural-looking elements that one could add to the dataset. We apply this approach to the problem of landmarks detection and augment the available annotation in both unpaired and in semi-supervised scenarios. Additionally, the idea is validated on cardiac data for the task of medical segmentation. Our approach reduces the overfitting and improves the quality metrics beyond the original data outcome and beyond the result obtained with popular modern augmentation methods.Comment: 11 pages, 4 figures, 3 tables. I.B. and N.B. contributed equally. D.V.D. is the corresponding autho

    From Anecdotal Evidence to Quantitative Evaluation Methods:A Systematic Review on Evaluating Explainable AI

    Get PDF
    The rising popularity of explainable artificial intelligence (XAI) to understand high-performing black boxes, also raised the question of how to evaluate explanations of machine learning (ML) models. While interpretability and explainability are often presented as a subjectively validated binary property, we consider it a multi-faceted concept. We identify 12 conceptual properties, such as Compactness and Correctness, that should be evaluated for comprehensively assessing the quality of an explanation. Our so-called Co-12 properties serve as categorization scheme for systematically reviewing the evaluation practice of more than 300 papers published in the last 7 years at major AI and ML conferences that introduce an XAI method. We find that 1 in 3 papers evaluate exclusively with anecdotal evidence, and 1 in 5 papers evaluate with users. We also contribute to the call for objective, quantifiable evaluation methods by presenting an extensive overview of quantitative XAI evaluation methods. This systematic collection of evaluation methods provides researchers and practitioners with concrete tools to thoroughly validate, benchmark and compare new and existing XAI methods. This also opens up opportunities to include quantitative metrics as optimization criteria during model training in order to optimize for accuracy and interpretability simultaneously.Comment: Link to website added: https://utwente-dmb.github.io/xai-papers

    Towards accurate multi-person pose estimation in the wild

    Get PDF
    In this thesis we are concerned with the problem of articulated human pose estimation and pose tracking in images and video sequences. Human pose estimation is a task of localising major joints of a human skeleton in natural images and is one of the most important visual recognition tasks in the scenes containing humans with numerous applications in robotics, virtual and augmented reality, gaming and healthcare among others. Articulated human pose tracking requires tracking multiple persons in the video sequence while simultaneously estimating full body poses. This task is important for analysing surveillance footage, activity recognition, sports analytics, etc. Most of the prior work focused on the pose estimation of single pre-localised humans whereas here we address a case with multiple people in real world images which entails several challenges such as person-person overlaps in highly crowded scenes, unknown number of people or people entering and leaving video sequences. The first contribution is a multi-person pose estimation algorithm based on the bottom-up detection-by-grouping paradigm. Unlike the widespread top-down approaches our method detects body joints and pairwise relations between them in a single forward pass of a convolutional neural network. Multi-person parsing is performed by optimizing a joint objective based on a multicut graph partitioning framework. Secondly, we extend our pose estimation approach to articulated multi-person pose tracking in videos. Our approach performs multi-target tracking and pose estimation in a holistic manner by optimising a single objective. We further simplify and refine the formulation which allows us to reach close to the real-time performance. Thirdly, we propose a large scale dataset and a benchmark for articulated multi-person tracking. It is the first dataset of video sequences comprising complex multi-person scenes and fully annotated tracks with 2D keypoints. Our fourth contribution is a method for estimating 3D body pose using on-body wearable cameras. Our approach uses a pair of downward facing, head-mounted cameras and captures an entire body. This egocentric approach is free of limitations of traditional setups with external cameras and can estimate body poses in very crowded environments. Our final contribution goes beyond human pose estimation and is in the field of deep learning of 3D object shapes. In particular, we address the case of reconstructing 3D objects from weak supervision. Our approach represents objects as 3D point clouds and is able to learn them with 2D supervision only and without requiring camera pose information at training time. We design a differentiable renderer of point clouds as well as a novel loss formulation for dealing with camera pose ambiguity.In dieser Arbeit behandeln wir das Problem der Schätzung und Verfolgung artikulierter menschlicher Posen in Bildern und Video-Sequenzen. Die Schätzung menschlicher Posen besteht darin die Hauptgelenke des menschlichen Skeletts in natürlichen Bildern zu lokalisieren und ist eine der wichtigsten Aufgaben der visuellen Erkennung in Szenen, die Menschen beinhalten. Sie hat zahlreiche Anwendungen in der Robotik, virtueller und erweiterter Realität, in Videospielen, in der Medizin und weiteren Bereichen. Die Verfolgung artikulierter menschlicher Posen erfordert die Verfolgung mehrerer Personen in einer Videosequenz bei gleichzeitiger Schätzung vollständiger Körperhaltungen. Diese Aufgabe ist besonders wichtig für die Analyse von Video-Überwachungsaufnahmen, Aktivitätenerkennung, digitale Sportanalyse etc. Die meisten vorherigen Arbeiten sind auf die Schätzung einzelner Posen vorlokalisierter Menschen fokussiert, wohingegen wir den Fall mehrerer Personen in natürlichen Aufnahmen betrachten. Dies bringt einige Herausforderungen mit sich, wie die Überlappung verschiedener Personen in dicht gedrängten Szenen, eine unbekannte Anzahl an Personen oder Personen die das Sichtfeld der Video-Sequenz verlassen oder betreten. Der erste Beitrag ist ein Algorithmus zur Schätzung der Posen mehrerer Personen, welcher auf dem Paradigma der Erkennung durch Gruppierung aufbaut. Im Gegensatz zu den verbreiteten Verfeinerungs-Ansätzen erkennt unsere Methode Körpergelenke and paarweise Beziehungen zwischen ihnen in einer einzelnen Vorwärtsrechnung eines faltenden neuronalen Netzwerkes. Die Gliederung in mehrere Personen erfolgt durch Optimierung einer gemeinsamen Zielfunktion, die auf dem Mehrfachschnitt-Problem in der Graphenzerlegung basiert. Zweitens erweitern wir unseren Ansatz zur Posen-Bestimmung auf das Verfolgen mehrerer Personen und deren Artikulation in Videos. Unser Ansatz führt eine Verfolgung mehrerer Ziele und die Schätzung der zugehörigen Posen in ganzheitlicher Weise durch, indem eine einzelne Zielfunktion optimiert wird. Desweiteren vereinfachen und verfeinern wir die Formulierung, was unsere Methode nah an Echtzeit-Leistung bringt. Drittens schlagen wir einen großen Datensatz und einen Bewertungsmaßstab für die Verfolgung mehrerer artikulierter Personen vor. Dies ist der erste Datensatz der Video-Sequenzen von komplexen Szenen mit mehreren Personen beinhaltet und deren Spuren komplett mit zwei-dimensionalen Markierungen der Schlüsselpunkte versehen sind. Unser vierter Beitrag ist eine Methode zur Schätzung von drei-dimensionalen Körperhaltungen mittels am Körper tragbarer Kameras. Unser Ansatz verwendet ein Paar nach unten gerichteter, am Kopf befestigter Kameras und erfasst den gesamten Körper. Dieser egozentrische Ansatz ist frei von jeglichen Limitierungen traditioneller Konfigurationen mit externen Kameras und kann Körperhaltungen in sehr dicht gedrängten Umgebungen bestimmen. Unser letzter Beitrag geht über die Schätzung menschlicher Posen hinaus in den Bereich des tiefen Lernens der Gestalt von drei-dimensionalen Objekten. Insbesondere befassen wir uns mit dem Fall drei-dimensionale Objekte unter schwacher Überwachung zu rekonstruieren. Unser Ansatz repräsentiert Objekte als drei-dimensionale Punktwolken and ist im Stande diese nur mittels zwei-dimensionaler Überwachung und ohne Informationen über die Kamera-Ausrichtung zur Trainingszeit zu lernen. Wir entwerfen einen differenzierbaren Renderer für Punktwolken sowie eine neue Formulierung um mit uneindeutigen Kamera-Ausrichtungen umzugehen
    corecore