72 research outputs found
Invariant Teacher and Equivariant Student for Unsupervised 3D Human Pose Estimation
We propose a novel method based on teacher-student learning framework for 3D
human pose estimation without any 3D annotation or side information. To solve
this unsupervised-learning problem, the teacher network adopts
pose-dictionary-based modeling for regularization to estimate a physically
plausible 3D pose. To handle the decomposition ambiguity in the teacher
network, we propose a cycle-consistent architecture promoting a 3D
rotation-invariant property to train the teacher network. To further improve
the estimation accuracy, the student network adopts a novel graph convolution
network for flexibility to directly estimate the 3D coordinates. Another
cycle-consistent architecture promoting 3D rotation-equivariant property is
adopted to exploit geometry consistency, together with knowledge distillation
from the teacher network to improve the pose estimation performance. We conduct
extensive experiments on Human3.6M and MPI-INF-3DHP. Our method reduces the 3D
joint prediction error by 11.4% compared to state-of-the-art unsupervised
methods and also outperforms many weakly-supervised methods that use side
information on Human3.6M. Code will be available at
https://github.com/sjtuxcx/ITES.Comment: Accepted in AAAI 202
Anatomy-guided domain adaptation for 3D in-bed human pose estimation
3D human pose estimation is a key component of clinical monitoring systems.
The clinical applicability of deep pose estimation models, however, is limited
by their poor generalization under domain shifts along with their need for
sufficient labeled training data. As a remedy, we present a novel domain
adaptation method, adapting a model from a labeled source to a shifted
unlabeled target domain. Our method comprises two complementary adaptation
strategies based on prior knowledge about human anatomy. First, we guide the
learning process in the target domain by constraining predictions to the space
of anatomically plausible poses. To this end, we embed the prior knowledge into
an anatomical loss function that penalizes asymmetric limb lengths, implausible
bone lengths, and implausible joint angles. Second, we propose to filter pseudo
labels for self-training according to their anatomical plausibility and
incorporate the concept into the Mean Teacher paradigm. We unify both
strategies in a point cloud-based framework applicable to unsupervised and
source-free domain adaptation. Evaluation is performed for in-bed pose
estimation under two adaptation scenarios, using the public SLP dataset and a
newly created dataset. Our method consistently outperforms various
state-of-the-art domain adaptation methods, surpasses the baseline model by
31%/66%, and reduces the domain gap by 65%/82%. Source code is available at
https://github.com/multimodallearning/da-3dhpe-anatomy.Comment: submitted to Medical Image Analysi
ScarceNet: Animal Pose Estimation with Scarce Annotations
Animal pose estimation is an important but under-explored task due to the
lack of labeled data. In this paper, we tackle the task of animal pose
estimation with scarce annotations, where only a small set of labeled data and
unlabeled images are available. At the core of the solution to this problem
setting is the use of the unlabeled data to compensate for the lack of
well-labeled animal pose data. To this end, we propose the ScarceNet, a pseudo
label-based approach to generate artificial labels for the unlabeled images.
The pseudo labels, which are generated with a model trained with the small set
of labeled images, are generally noisy and can hurt the performance when
directly used for training. To solve this problem, we first use a small-loss
trick to select reliable pseudo labels. Although effective, the selection
process is improvident since numerous high-loss samples are left unused. We
further propose to identify reusable samples from the high-loss samples based
on an agreement check. Pseudo labels are re-generated to provide supervision
for those reusable samples. Lastly, we introduce a student-teacher framework to
enforce a consistency constraint since there are still samples that are neither
reliable nor reusable. By combining the reliable pseudo label selection with
the reusable sample re-labeling and the consistency constraint, we can make
full use of the unlabeled data. We evaluate our approach on the challenging
AP-10K dataset, where our approach outperforms existing semi-supervised
approaches by a large margin. We also test on the TigDog dataset, where our
approach can achieve better performance than domain adaptation based approaches
when only very few annotations are available. Our code is available at the
project website.Comment: Accepted to CVPR 202
EqMotion: Equivariant Multi-agent Motion Prediction with Invariant Interaction Reasoning
Learning to predict agent motions with relationship reasoning is important
for many applications. In motion prediction tasks, maintaining motion
equivariance under Euclidean geometric transformations and invariance of agent
interaction is a critical and fundamental principle. However, such equivariance
and invariance properties are overlooked by most existing methods. To fill this
gap, we propose EqMotion, an efficient equivariant motion prediction model with
invariant interaction reasoning. To achieve motion equivariance, we propose an
equivariant geometric feature learning module to learn a Euclidean
transformable feature through dedicated designs of equivariant operations. To
reason agent's interactions, we propose an invariant interaction reasoning
module to achieve a more stable interaction modeling. To further promote more
comprehensive motion features, we propose an invariant pattern feature learning
module to learn an invariant pattern feature, which cooperates with the
equivariant geometric feature to enhance network expressiveness. We conduct
experiments for the proposed model on four distinct scenarios: particle
dynamics, molecule dynamics, human skeleton motion prediction and pedestrian
trajectory prediction. Experimental results show that our method is not only
generally applicable, but also achieves state-of-the-art prediction
performances on all the four tasks, improving by 24.0/30.1/8.6/9.2%. Code is
available at https://github.com/MediaBrain-SJTU/EqMotion.Comment: Accepted to CVPR 202
Visual Representation Learning with Limited Supervision
The quality of a Computer Vision system is proportional to the rigor of data representation it is built upon. Learning expressive representations of images is therefore the centerpiece to almost every computer vision application, including image search, object detection and classification, human re-identification, object tracking, pose understanding, image-to-image translation, and embodied agent navigation to name a few. Deep Neural Networks are most often seen among the modern methods of representation learning. The limitation is, however, that deep representation learning methods require extremely large amounts of manually labeled data for training. Clearly, annotating vast amounts of images for various environments is infeasible due to cost and time constraints. This requirement of obtaining labeled data is a prime restriction regarding pace of the development of visual recognition systems.
In order to cope with the exponentially growing amounts of visual data generated daily, machine learning algorithms have to at least strive to scale at a similar rate.
The second challenge consists in the learned representations having to generalize to novel objects, classes, environments and tasks in order to accommodate to the diversity of the visual world.
Despite the evergrowing number of recent publications tangentially addressing the topic of learning generalizable representations, efficient generalization is yet to be achieved. This dissertation attempts to tackle the problem of learning visual representations that can generalize to novel settings while requiring few labeled examples.
In this research, we study the limitations of the existing supervised representation learning approaches and propose a framework that improves the generalization of learned features by exploiting visual similarities between images which are not captured by provided manual annotations. Furthermore, to mitigate the common requirement of large scale manually annotated datasets, we propose several approaches that can learn expressive representations without human-attributed labels, in a self-supervised fashion, by grouping highly-similar samples into surrogate classes based on progressively learned representations.
The development of computer vision as science is preconditioned upon the seamless ability of a machine to record and disentangle pictures' attributes that were expected to only be conceived by humans. As such, particular interest was dedicated to the ability to analyze the means of artistic expression and style which depicts a more complex task than merely breaking an image down to colors and pixels. The ultimate test for this ability is the task of style transfer which involves altering the style of an image while keeping its content. An effective solution of style transfer requires learning such image representation which would allow disentangling image style and its content.
Moreover, particular artistic styles come with idiosyncrasies that affect which content details should be preserved and which discarded.
Another pitfall here is that it is impossible to get pixel-wise annotations of style and how the style should be altered.
We address this problem by proposing an unsupervised approach that enables encoding the image content in such a way that is required by a particular style.
The proposed approach exchanges the style of an input image by first extracting the content representation in a style-aware way and then rendering it in a new style using a style-specific decoder network, achieving compelling results in image and video stylization.
Finally, we combine supervised and self-supervised representation learning techniques for the task of human and animals pose understanding. The proposed method enables transfer of the representation learned for recognition of human poses to proximal mammal species without using labeled animal images. This approach is not limited to dense pose estimation and could potentially enable autonomous agents from robots to self-driving cars to retrain themselves and adapt to novel environments based on learning from previous experiences
A Survey on Generative Diffusion Model
Deep learning shows excellent potential in generation tasks thanks to deep
latent representation. Generative models are classes of models that can
generate observations randomly concerning certain implied parameters. Recently,
the diffusion Model has become a rising class of generative models by its
power-generating ability. Nowadays, great achievements have been reached. More
applications except for computer vision, speech generation, bioinformatics, and
natural language processing are to be explored in this field. However, the
diffusion model has its genuine drawback of a slow generation process, single
data types, low likelihood, and the inability for dimension reduction. They are
leading to many enhanced works. This survey makes a summary of the field of the
diffusion model. We first state the main problem with two landmark works --
DDPM and DSM, and a unified landmark work -- Score SDE. Then, we present
improved techniques for existing problems in the diffusion-based model field,
including speed-up improvement For model speed-up improvement, data structure
diversification, likelihood optimization, and dimension reduction. Regarding
existing models, we also provide a benchmark of FID score, IS, and NLL
according to specific NFE. Moreover, applications with diffusion models are
introduced including computer vision, sequence modeling, audio, and AI for
science. Finally, there is a summarization of this field together with
limitations \& further directions. The summation of existing well-classified
methods is in our
Github:https://github.com/chq1155/A-Survey-on-Generative-Diffusion-Model
Semi-Supervised and Unsupervised Deep Visual Learning: A Survey
State-of-the-art deep learning models are often trained with a large amountof costly labeled training data. However, requiring exhaustive manualannotations may degrade the model's generalizability in the limited-labelregime. Semi-supervised learning and unsupervised learning offer promisingparadigms to learn from an abundance of unlabeled visual data. Recent progressin these paradigms has indicated the strong benefits of leveraging unlabeleddata to improve model generalization and provide better model initialization.In this survey, we review the recent advanced deep learning algorithms onsemi-supervised learning (SSL) and unsupervised learning (UL) for visualrecognition from a unified perspective. To offer a holistic understanding ofthe state-of-the-art in these areas, we propose a unified taxonomy. Wecategorize existing representative SSL and UL with comprehensive and insightfulanalysis to highlight their design rationales in different learning scenariosand applications in different computer vision tasks. Lastly, we discuss theemerging trends and open challenges in SSL and UL to shed light on futurecritical research directions.<br
- …