9 research outputs found
Self-supervised Keypoint Correspondences for Multi-Person Pose Estimation and Tracking in Videos
Video annotation is expensive and time consuming. Consequently, datasets for
multi-person pose estimation and tracking are less diverse and have more sparse
annotations compared to large scale image datasets for human pose estimation.
This makes it challenging to learn deep learning based models for associating
keypoints across frames that are robust to nuisance factors such as motion blur
and occlusions for the task of multi-person pose tracking. To address this
issue, we propose an approach that relies on keypoint correspondences for
associating persons in videos. Instead of training the network for estimating
keypoint correspondences on video data, it is trained on a large scale image
datasets for human pose estimation using self-supervision. Combined with a
top-down framework for human pose estimation, we use keypoints correspondences
to (i) recover missed pose detections (ii) associate pose detections across
video frames. Our approach achieves state-of-the-art results for multi-frame
pose estimation and multi-person pose tracking on the PosTrack and
PoseTrack data sets.Comment: Submitted to ECCV 202
Learning Person Re-identification Models from Videos with Weak Supervision
Most person re-identification methods, being supervised techniques, suffer
from the burden of massive annotation requirement. Unsupervised methods
overcome this need for labeled data, but perform poorly compared to the
supervised alternatives. In order to cope with this issue, we introduce the
problem of learning person re-identification models from videos with weak
supervision. The weak nature of the supervision arises from the requirement of
video-level labels, i.e. person identities who appear in the video, in contrast
to the more precise framelevel annotations. Towards this goal, we propose a
multiple instance attention learning framework for person re-identification
using such video-level labels. Specifically, we first cast the video person
re-identification task into a multiple instance learning setting, in which
person images in a video are collected into a bag. The relations between videos
with similar labels can be utilized to identify persons, on top of that, we
introduce a co-person attention mechanism which mines the similarity
correlations between videos with person identities in common. The attention
weights are obtained based on all person images instead of person tracklets in
a video, making our learned model less affected by noisy annotations. Extensive
experiments demonstrate the superiority of the proposed method over the related
methods on two weakly labeled person re-identification datasets
ReMarNet: Conjoint Relation and Margin Learning for Small-Sample Image Classification
Despite achieving state-of-the-art performance, deep learning methods
generally require a large amount of labeled data during training and may suffer
from overfitting when the sample size is small. To ensure good generalizability
of deep networks under small sample sizes, learning discriminative features is
crucial. To this end, several loss functions have been proposed to encourage
large intra-class compactness and inter-class separability. In this paper, we
propose to enhance the discriminative power of features from a new perspective
by introducing a novel neural network termed Relation-and-Margin learning
Network (ReMarNet). Our method assembles two networks of different backbones so
as to learn the features that can perform excellently in both of the
aforementioned two classification mechanisms. Specifically, a relation network
is used to learn the features that can support classification based on the
similarity between a sample and a class prototype; at the meantime, a fully
connected network with the cross entropy loss is used for classification via
the decision boundary. Experiments on four image datasets demonstrate that our
approach is effective in learning discriminative features from a small set of
labeled samples and achieves competitive performance against state-of-the-art
methods. Codes are available at https://github.com/liyunyu08/ReMarNet.Comment: IEEE TCSVT 202
Towards accurate multi-person pose estimation in the wild
In this thesis we are concerned with the problem of articulated human pose estimation and pose tracking in images and video sequences. Human pose estimation is a task of localising major joints of a human skeleton in natural images and is one of the most important visual recognition tasks in the scenes containing humans with numerous applications in robotics, virtual and augmented reality, gaming and healthcare among others. Articulated human pose tracking requires tracking multiple persons in the video sequence while simultaneously estimating full body poses. This task is important for analysing surveillance footage, activity recognition, sports analytics, etc. Most of the prior work focused on the pose estimation of single pre-localised humans whereas here we address a case with multiple people in real world images which entails several challenges such as person-person overlaps in highly crowded scenes, unknown number of people or people entering and leaving video sequences. The first contribution is a multi-person pose estimation algorithm based on the bottom-up detection-by-grouping paradigm. Unlike the widespread top-down approaches our method detects body joints and pairwise relations between them in a single forward pass of a convolutional neural network. Multi-person parsing is performed by optimizing a joint objective based on a multicut graph partitioning framework. Secondly, we extend our pose estimation approach to articulated multi-person pose tracking in videos. Our approach performs multi-target tracking and pose estimation in a holistic manner by optimising a single objective. We further simplify and refine the formulation which allows us to reach close to the real-time performance. Thirdly, we propose a large scale dataset and a benchmark for articulated multi-person tracking. It is the first dataset of video sequences comprising complex multi-person scenes and fully annotated tracks with 2D keypoints. Our fourth contribution is a method for estimating 3D body pose using on-body wearable cameras. Our approach uses a pair of downward facing, head-mounted cameras and captures an entire body. This egocentric approach is free of limitations of traditional setups with external cameras and can estimate body poses in very crowded environments. Our final contribution goes beyond human pose estimation and is in the field of deep learning of 3D object shapes. In particular, we address the case of reconstructing 3D objects from weak supervision. Our approach represents objects as 3D point clouds and is able to learn them with 2D supervision only and without requiring camera pose information at training time. We design a differentiable renderer of point clouds as well as a novel loss formulation for dealing with camera pose ambiguity.In dieser Arbeit behandeln wir das Problem der Schätzung und Verfolgung artikulierter menschlicher Posen in Bildern und Video-Sequenzen. Die Schätzung menschlicher Posen besteht darin die Hauptgelenke des menschlichen Skeletts in natürlichen Bildern zu lokalisieren und ist eine der wichtigsten Aufgaben der visuellen Erkennung in Szenen, die Menschen beinhalten. Sie hat zahlreiche Anwendungen in der Robotik, virtueller und erweiterter Realität, in Videospielen, in der Medizin und weiteren Bereichen. Die Verfolgung artikulierter menschlicher Posen erfordert die Verfolgung mehrerer Personen in einer Videosequenz bei gleichzeitiger Schätzung vollständiger Körperhaltungen. Diese Aufgabe ist besonders wichtig für die Analyse von Video-Überwachungsaufnahmen, Aktivitätenerkennung, digitale Sportanalyse etc. Die meisten vorherigen Arbeiten sind auf die Schätzung einzelner Posen vorlokalisierter Menschen fokussiert, wohingegen wir den Fall mehrerer Personen in natürlichen Aufnahmen betrachten. Dies bringt einige Herausforderungen mit sich, wie die Überlappung verschiedener Personen in dicht gedrängten Szenen, eine unbekannte Anzahl an Personen oder Personen die das Sichtfeld der Video-Sequenz verlassen oder betreten. Der erste Beitrag ist ein Algorithmus zur Schätzung der Posen mehrerer Personen, welcher auf dem Paradigma der Erkennung durch Gruppierung aufbaut. Im Gegensatz zu den verbreiteten Verfeinerungs-Ansätzen erkennt unsere Methode Körpergelenke and paarweise Beziehungen zwischen ihnen in einer einzelnen Vorwärtsrechnung eines faltenden neuronalen Netzwerkes. Die Gliederung in mehrere Personen erfolgt durch Optimierung einer gemeinsamen Zielfunktion, die auf dem Mehrfachschnitt-Problem in der Graphenzerlegung basiert. Zweitens erweitern wir unseren Ansatz zur Posen-Bestimmung auf das Verfolgen mehrerer Personen und deren Artikulation in Videos. Unser Ansatz führt eine Verfolgung mehrerer Ziele und die Schätzung der zugehörigen Posen in ganzheitlicher Weise durch, indem eine einzelne Zielfunktion optimiert wird. Desweiteren vereinfachen und verfeinern wir die Formulierung, was unsere Methode nah an Echtzeit-Leistung bringt. Drittens schlagen wir einen großen Datensatz und einen Bewertungsmaßstab für die Verfolgung mehrerer artikulierter Personen vor. Dies ist der erste Datensatz der Video-Sequenzen von komplexen Szenen mit mehreren Personen beinhaltet und deren Spuren komplett mit zwei-dimensionalen Markierungen der Schlüsselpunkte versehen sind. Unser vierter Beitrag ist eine Methode zur Schätzung von drei-dimensionalen Körperhaltungen mittels am Körper tragbarer Kameras. Unser Ansatz verwendet ein Paar nach unten gerichteter, am Kopf befestigter Kameras und erfasst den gesamten Körper. Dieser egozentrische Ansatz ist frei von jeglichen Limitierungen traditioneller Konfigurationen mit externen Kameras und kann Körperhaltungen in sehr dicht gedrängten Umgebungen bestimmen. Unser letzter Beitrag geht über die Schätzung menschlicher Posen hinaus in den Bereich des tiefen Lernens der Gestalt von drei-dimensionalen Objekten. Insbesondere befassen wir uns mit dem Fall drei-dimensionale Objekte unter schwacher Überwachung zu rekonstruieren. Unser Ansatz repräsentiert Objekte als drei-dimensionale Punktwolken and ist im Stande diese nur mittels zwei-dimensionaler Überwachung und ohne Informationen über die Kamera-Ausrichtung zur Trainingszeit zu lernen. Wir entwerfen einen differenzierbaren Renderer für Punktwolken sowie eine neue Formulierung um mit uneindeutigen Kamera-Ausrichtungen umzugehen