1,859 research outputs found
A Sampling Approach to Generating Closely Interacting 3D Pose-pairs from 2D Annotations
We introduce a data-driven method to generate a large number of plausible, closely interacting 3D human pose-pairs, for a given motion category, e.g., wrestling or salsa dance. With much difficulty in acquiring close interactions using 3D sensors, our approach utilizes abundant existing video data which cover many human activities. Instead of treating the data generation problem as one of reconstruction, either through 3D acquisition or direct 2D-to-3D data lifting from video annotations, we present a solution based on Markov Chain Monte Carlo (MCMC) sampling. With a focus on efficient sampling over the space of close interactions, rather than pose spaces, we develop a novel representation called interaction coordinates (IC) to encode both poses and their interactions in an integrated manner. Plausibility of a 3D pose-pair is then defined based on the ICs and with respect to the annotated 2D pose-pairs from video. We show that our sampling-based approach is able to efficiently synthesize a large volume of plausible, closely interacting 3D pose-pairs which provide a good coverage of the input 2D pose-pairs
Hand Keypoint Detection in Single Images using Multiview Bootstrapping
We present an approach that uses a multi-camera system to train fine-grained
detectors for keypoints that are prone to occlusion, such as the joints of a
hand. We call this procedure multiview bootstrapping: first, an initial
keypoint detector is used to produce noisy labels in multiple views of the
hand. The noisy detections are then triangulated in 3D using multiview geometry
or marked as outliers. Finally, the reprojected triangulations are used as new
labeled training data to improve the detector. We repeat this process,
generating more labeled data in each iteration. We derive a result analytically
relating the minimum number of views to achieve target true and false positive
rates for a given detector. The method is used to train a hand keypoint
detector for single images. The resulting keypoint detector runs in realtime on
RGB images and has accuracy comparable to methods that use depth sensors. The
single view detector, triangulated over multiple views, enables 3D markerless
hand motion capture with complex object interactions.Comment: CVPR 201
Generative Proxemics: A Prior for 3D Social Interaction from Images
Social interaction is a fundamental aspect of human behavior and
communication. The way individuals position themselves in relation to others,
also known as proxemics, conveys social cues and affects the dynamics of social
interaction. Reconstructing such interaction from images presents challenges
because of mutual occlusion and the limited availability of large training
datasets. To address this, we present a novel approach that learns a prior over
the 3D proxemics two people in close social interaction and demonstrate its use
for single-view 3D reconstruction. We start by creating 3D training data of
interacting people using image datasets with contact annotations. We then model
the proxemics using a novel denoising diffusion model called BUDDI that learns
the joint distribution over the poses of two people in close social
interaction. Sampling from our generative proxemics model produces realistic 3D
human interactions, which we validate through a perceptual study. We use BUDDI
in reconstructing two people in close proximity from a single image without any
contact annotation via an optimization approach that uses the diffusion model
as a prior. Our approach recovers accurate and plausible 3D social interactions
from noisy initial estimates, outperforming state-of-the-art methods. Our code,
data, and model are availableat our project website at: muelea.github.io/buddi.Comment: Project website: muelea.github.io/budd
Crowdsourcing in Computer Vision
Computer vision systems require large amounts of manually annotated data to
properly learn challenging visual concepts. Crowdsourcing platforms offer an
inexpensive method to capture human knowledge and understanding, for a vast
number of visual perception tasks. In this survey, we describe the types of
annotations computer vision researchers have collected using crowdsourcing, and
how they have ensured that this data is of high quality while annotation effort
is minimized. We begin by discussing data collection on both classic (e.g.,
object recognition) and recent (e.g., visual story-telling) vision tasks. We
then summarize key design decisions for creating effective data collection
interfaces and workflows, and present strategies for intelligently selecting
the most important data instances to annotate. Finally, we conclude with some
thoughts on the future of crowdsourcing in computer vision.Comment: A 69-page meta review of the field, Foundations and Trends in
Computer Graphics and Vision, 201
μ 체 μλ² λ©μ νμ©ν μ€ν μΈμ½λ κΈ°λ° μ»΄ν¨ν° λΉμ λͺ¨νμ μ±λ₯ κ°μ
νμλ
Όλ¬Έ(λ°μ¬) -- μμΈλνκ΅λνμ : 곡과λν μ°μ
곡νκ³Ό, 2021.8. λ°μ’
ν.Deep learning models have dominated the field of computer vision, achieving state-of-the-art performance in various tasks. In particular, with recent increases in images and videos of people being posted on social media, research on computer vision tasks for analyzing human visual information is being used in various ways.
This thesis addresses classifying fashion styles and measuring motion similarity as two computer vision tasks related to humans. In real-world fashion style classification problems, the number of samples collected for each style class varies according to the fashion trend at the time of data collection, resulting in class imbalance. In this thesis, to cope with this class imbalance problem, generalized few-shot learning, in which both minority classes and majority classes are used for learning and evaluation, is employed. Additionally, the modalities of the foreground images, cropped to show only the body and fashion item parts, and the fashion attribute information are reflected in the fashion image embedding through a variational autoencoder. The K-fashion dataset collected from a Korean fashion shopping mall is used for the model training and evaluation.
Motion similarity measurement is used as a sub-module in various tasks such as action recognition, anomaly detection, and person re-identification; however, it has attracted less attention than the other tasks because the same motion can be represented differently depending on the performer's body structure and camera angle. The lack of public datasets for model training and evaluation also makes research challenging. Therefore, we propose an artificial dataset for model training, with motion embeddings separated from the body structure and camera angle attributes for training using an autoencoder architecture. The autoencoder is designed to generate motion embeddings for each body part to measure motion similarity by body part. Furthermore, motion speed is synchronized by matching patches performing similar motions using dynamic time warping. The similarity score dataset for evaluation was collected through a crowdsourcing platform utilizing videos of NTU RGB+D 120, a dataset for action recognition.
When the proposed models were verified with each evaluation dataset, both outperformed the baselines. In the fashion style classification problem, the proposed model showed the most balanced performance, without bias toward either the minority classes or the majority classes, among all the models. In addition, In the motion similarity measurement experiments, the correlation coefficient of the proposed model to the human-measured similarity score was higher than that of the baselines.μ»΄ν¨ν° λΉμ μ λ₯λ¬λ νμ΅ λ°©λ²λ‘ μ΄ κ°μ μ 보μ΄λ λΆμΌλ‘, λ€μν νμ€ν¬μμ μ°μν μ±λ₯μ 보μ΄κ³ μλ€. νΉν, μ¬λμ΄ ν¬ν¨λ μ΄λ―Έμ§λ λμμμ λ₯λ¬λμ ν΅ν΄ λΆμνλ νμ€ν¬μ κ²½μ°, μ΅κ·Ό μμ
λ―Έλμ΄μ μ¬λμ΄ ν¬ν¨λ μ΄λ―Έμ§ λλ λμμ κ²μλ¬Όμ΄ λμ΄λλ©΄μ κ·Έ νμ© κ°μΉκ° λμμ§κ³ μλ€.
λ³Έ λ
Όλ¬Έμμλ μ¬λκ³Ό κ΄λ ¨λ μ»΄ν¨ν° λΉμ νμ€ν¬ μ€ ν¨μ
μ€νμΌ λΆλ₯ λ¬Έμ μ λμ μ μ¬λ μΈ‘μ μ λν΄ λ€λ£¬λ€. ν¨μ
μ€νμΌ λΆλ₯ λ¬Έμ μ κ²½μ°, λ°μ΄ν° μμ§ μμ μ ν¨μ
μ νμ λ°λΌ μ€νμΌ ν΄λμ€λ³ μμ§λλ μνμ μμ΄ λ¬λΌμ§κΈ° λλ¬Έμ μ΄λ‘λΆν° ν΄λμ€ λΆκ· νμ΄ λ°μνλ€. λ³Έ λ
Όλ¬Έμμλ μ΄λ¬ν ν΄λμ€ λΆκ· ν λ¬Έμ μ λμ²νκΈ° μνμ¬, μμ μν ν΄λμ€μ λ€μ μν ν΄λμ€λ₯Ό νμ΅ λ° νκ°μ λͺ¨λ μ¬μ©νλ μΌλ°νλ ν¨μ·λ¬λμΌλ‘ ν¨μ
μ€νμΌ λΆλ₯ λ¬Έμ λ₯Ό μ€μ νμλ€. λν λ³λΆ μ€ν μΈμ½λ κΈ°λ°μ λͺ¨λΈμ ν΅ν΄, μ 체 λ° ν¨μ
μμ΄ν
λΆλΆλ§ μλΌλΈ μ κ²½ μ΄λ―Έμ§ λͺ¨λ¬λ¦¬ν°μ ν¨μ
μμ± μ 보 λͺ¨λ¬λ¦¬ν°κ° ν¨μ
μ΄λ―Έμ§μ μλ² λ© νμ΅μ λ°μλλλ‘ νμλ€. νμ΅ λ° νκ°λ₯Ό μν λ°μ΄ν°μ
μΌλ‘λ νκ΅ ν¨μ
μΌνλͺ°μμ μμ§λ K-fashion λ°μ΄ν°μ
μ μ¬μ©νμλ€.
ννΈ, λμ μ μ¬λ μΈ‘μ μ νμ μΈμ, μ΄μ λμ κ°μ§, μ¬λ μ¬μΈμ κ°μ λ€μν λΆμΌμ νμ λͺ¨λλ‘ νμ©λκ³ μμ§λ§ κ·Έ μμ²΄κ° μ°κ΅¬λ μ μ λ§μ§ μμλ°, μ΄λ κ°μ λμμ μννλλΌλ μ 체 ꡬ쑰 λ° μΉ΄λ©λΌ κ°λμ λ°λΌ λ€λ₯΄κ² ννλ μ μλ€λ μ μΌλ‘ λΆν° κΈ°μΈνλ€. νμ΅ λ° νκ°λ₯Ό μν κ³΅κ° λ°μ΄ν°μ
μ΄ λ§μ§ μλ€λ μ λν μ°κ΅¬λ₯Ό μ΄λ ΅κ² νλ μμΈμ΄λ€. λ°λΌμ λ³Έ λ
Όλ¬Έμμλ νμ΅μ μν μΈκ³΅ λ°μ΄ν°μ
μ μμ§νμ¬ μ€ν μΈμ½λ ꡬ쑰λ₯Ό ν΅ν΄ μ 체 ꡬ쑰 λ° μΉ΄λ©λΌ κ°λ μμκ° λΆλ¦¬λ λμ μλ² λ©μ νμ΅νμλ€. μ΄λ, κ° μ 체 λΆμλ³λ‘ λμ μλ² λ©μ μμ±ν μ μλλ‘νμ¬ μ 체 λΆμλ³λ‘ λμ μ μ¬λ μΈ‘μ μ΄ κ°λ₯νλλ‘ νμλ€. λ λμ μ¬μ΄μ μ μ¬λλ₯Ό μΈ‘μ ν λμλ λμ μκ° μν κΈ°λ²μ μ¬μ©, λΉμ·ν λμμ μννλ ꡬκ°λΌλ¦¬ μ λ ¬μμΌ μ μ¬λλ₯Ό μΈ‘μ νλλ‘ ν¨μΌλ‘μ¨, λμ μν μλμ μ°¨μ΄λ₯Ό 보μ νμλ€. νκ°λ₯Ό μν μ μ¬λ μ μ λ°μ΄ν°μ
μ νμ μΈμ λ°μ΄ν°μ
μΈ NTU-RGB+D 120μ μμμ νμ©νμ¬ ν¬λΌμ°λ μμ± νλ«νΌμ ν΅ν΄ μμ§λμλ€.
λ κ°μ§ νμ€ν¬μ μ μ λͺ¨λΈμ κ°κ°μ νκ° λ°μ΄ν°μ
μΌλ‘ κ²μ¦ν κ²°κ³Ό, λͺ¨λ λΉκ΅ λͺ¨λΈ λλΉ μ°μν μ±λ₯μ κΈ°λ‘νμλ€. ν¨μ
μ€νμΌ λΆλ₯ λ¬Έμ μ κ²½μ°, λͺ¨λ λΉκ΅κ΅°μμ μμ μν ν΄λμ€μ λ€μ μν ν΄λμ€ μ€ ν μͺ½μΌλ‘ μΉμ°μΉμ§ μλ κ°μ₯ κ· νμ‘ν μΆλ‘ μ±λ₯μ 보μ¬μ£Όμκ³ , λμ μ μ¬λ μΈ‘μ μ κ²½μ° μ¬λμ΄ μΈ‘μ ν μ μ¬λ μ μμ μκ΄κ³μμμ λΉκ΅ λͺ¨λΈ λλΉ λ λμ μμΉλ₯Ό λνλ΄μλ€.Chapter 1 Introduction 1
1.1 Background and motivation 1
1.2 Research contribution 5
1.2.1 Fashion style classication 5
1.2.2 Human motion similarity 9
1.2.3 Summary of the contributions 11
1.3 Thesis outline 13
Chapter 2 Literature Review 14
2.1 Fashion style classication 14
2.1.1 Machine learning and deep learning-based approaches 14
2.1.2 Class imbalance 15
2.1.3 Variational autoencoder 17
2.2 Human motion similarity 19
2.2.1 Measuring the similarity between two people 19
2.2.2 Human body embedding 20
2.2.3 Datasets for measuring the similarity 20
2.2.4 Triplet and quadruplet losses 21
2.2.5 Dynamic time warping 22
Chapter 3 Fashion Style Classication 24
3.1 Dataset for fashion style classication: K-fashion 24
3.2 Multimodal variational inference for fashion style classication 28
3.2.1 CADA-VAE 31
3.2.2 Generating multimodal features 33
3.2.3 Classier training with cyclic oversampling 36
3.3 Experimental results for fashion style classication 38
3.3.1 Implementation details 38
3.3.2 Settings for experiments 42
3.3.3 Experimental results on K-fashion 44
3.3.4 Qualitative analysis 48
3.3.5 Eectiveness of the cyclic oversampling 50
Chapter 4 Motion Similarity Measurement 53
4.1 Datasets for motion similarity 53
4.1.1 Synthetic motion dataset: SARA dataset 53
4.1.2 NTU RGB+D 120 similarity annotations 55
4.2 Framework for measuring motion similarity 58
4.2.1 Body part embedding model 58
4.2.2 Measuring motion similarity 67
4.3 Experimental results for measuring motion similarity 68
4.3.1 Implementation details 68
4.3.2 Experimental results on NTU RGB+D 120 similarity annotations 72
4.3.3 Visualization of motion latent clusters 78
4.4 Application 81
4.4.1 Real-world application with dancing videos 81
4.4.2 Tuning similarity scores to match human perception 87
Chapter 5 Conclusions 89
5.1 Summary and contributions 89
5.2 Limitations and future research 91
Appendices 93
Chapter A NTU RGB+D 120 Similarity Annotations 94
A.1 Data collection 94
A.2 AMT score analysis 95
Chapter B Data Cleansing of NTU RGB+D 120 Skeletal Data 100
Chapter C Motion Sequence Generation Using Mixamo 102
Bibliography 104
κ΅λ¬Έμ΄λ‘ 123λ°
Vision for Social Robots: Human Perception and Pose Estimation
In order to extract the underlying meaning from a scene captured from the surrounding world in a single still image, social robots will need to learn the human ability to detect different objects, understand their arrangement and relationships relative both to their own parts and to each other, and infer the dynamics under which they are evolving. Furthermore, they will need to develop and hold a notion of context to allow assigning different meanings (semantics) to the same visual configuration (syntax) of a scene.
The underlying thread of this Thesis is the investigation of new ways for enabling interactions between social robots and humans, by advancing the visual perception capabilities of robots when they process images and videos in which humans are the main focus of attention.
First, we analyze the general problem of scene understanding, as social robots moving through the world need to be able to interpret scenes without having been assigned a specific preset goal. Throughout this line of research, i) we observe that human actions and interactions which can be visually discriminated from an image follow a very heavy-tailed distribution; ii) we develop an algorithm that can obtain a spatial understanding of a scene by only using cues arising from the effect of perspective on a picture of a personβs face; and iii) we define a novel taxonomy of errors for the task of estimating the 2D body pose of people in images to better explain the behavior of algorithms and highlight their underlying causes of error.
Second, we focus on the specific task of 3D human pose and motion estimation from monocular 2D images using weakly supervised training data, as accurately predicting human pose will open up the possibility of richer interactions between humans and social robots. We show that when 3D ground-truth data is only available in small quantities, or not at all, it is possible to leverage knowledge about the physical properties of the human body, along with additional constraints related to alternative types of supervisory signals, to learn models that can regress the full 3D pose of the human body and predict its motions from monocular 2D images.
Taken in its entirety, the intent of this Thesis is to highlight the importance of, and provide novel methodologies for, social robots' ability to interpret their surrounding environment, learn in a way that is robust to low data availability, and generalize previously observed behaviors to unknown situations in a similar way to humans.</p
Pushing the envelope for estimating poses and actions via full 3D reconstruction
Estimating poses and actions of human bodies and hands is an important task in the computer vision community due to its vast applications, including human
computer interaction, virtual reality and augmented reality, medical image analysis. Challenges: There are many in-the-wild challenges in this task (see chapter 1). Among them, in this thesis, we focused on two challenges which could be relieved by incorporating the 3D geometry: (1) inherent 2D-to-3D ambiguity driven by the non-linear 2D projection process when capturing 3D objects. (2) lack of sufficient and quality annotated datasets due to the high-dimensionality of subjects' attribute space and inherent difficulty in annotating 3D coordinate values. Contributions: We first tried to jointly tackle the 2D-to-3D ambiguity and insufficient data issues by (1) explicitly reconstructing 2.5D and 3D samples and use them as new training data to train a pose estimator. Next, we tried to (2) encode 3D geometry in the training process of the action recognizer to reduce the 2D-to-3D ambiguity. In appendix, we proposed a (3) new hand pose synthetic dataset that can be used for more complete attribute changes and multi-modal experiments in the future. Experiments: Throughout experiments, we found interesting facts: (1) 2.5D depth map reconstruction and data augmentation can improve the accuracy of the depth-based hand pose estimation algorithm, (2) 3D mesh reconstruction can be used to generate a new RGB data and it improves the accuracy of RGB-based dense hand pose estimation algorithm, (3) 3D geometry from 3D poses and scene layouts could be successfully utilized to reduce the 2D-to-3D ambiguity in the action recognition problem.Open Acces
Recommended from our members
Learning to See with Minimal Human Supervision
Deep learning has significantly advanced computer vision in the past decade, paving the way for practical applications such as facial recognition and autonomous driving. However, current techniques depend heavily on human supervision, limiting their broader deployment. This dissertation tackles this problem by introducing algorithms and theories to minimize human supervision in three key areas: data, annotations, and neural network architectures, in the context of various visual understanding tasks such as object detection, image restoration, and 3D generation.
First, we present self-supervised learning algorithms to handle in-the-wild images and videos that traditionally require time-consuming manual curation and labeling. We demonstrate that when a deep network is trained to be invariant to geometric and photometric transformations, representations from its intermediate layers are highly predictive of object semantic parts such as eyes and noses. This insight offers a simple unsupervised learning framework that significantly improves the efficiency and accuracy of few-shot landmark prediction and matching. We then present a technique for learning single-view 3D object pose estimation models by utilizing in-the-wild videos where objects turn (e.g., cars in roundabouts). This technique achieves competitive performance with respect to existing state-of-the-art without requiring any manual labels during training. We also contribute an Accidental Turntables Dataset, containing a challenging set of 41,212 images of cars in cluttered backgrounds, motion blur, and illumination changes that serve as a benchmark for 3D pose estimation.
Second, we address variations in labeling styles across different annotators, which leads to a type of noisy label referred to as heterogeneous label. This variability in human annotation can cause subpar performance during both the training and testing phases. To mitigate this, we have developed a framework that models the labeling styles of individual annotators, reducing the impact of human annotation variations and enhancing the performance of standard object detection models. We have also applied this framework to analyze ecological data, which are often collected opportunistically across different case studies without consistent annotation guidelines. Through this application, we have obtained several insightful observations into large-scale bird migration behaviors and their relationship to climate change.
Our next study explores the challenges of designing neural networks, an area that lacks a comprehensive theoretical understanding. By linking deep neural networks with Gaussian processes, we propose a novel Bayesian interpretation of the deep image prior, which parameterizes a natural image as the output of a convolutional network with random parameters and random input. This approach offers valuable insights to optimize the design of neural networks for various image restoration tasks.
Lastly, we introduce several machine-learning techniques to reconstruct and edit 3D shapes from 2D images with minimal human effort. We first present a generic multi-modal generative model that bridges 2D images and 3D shapes via a shared latent space, and demonstrate its applications on versatile 3D shape generation and manipulation tasks. Additionally, we develop a framework for joint estimation of 3D neural scene representation and camera poses. This approach outperforms prior works and allows us to operate in the general SE(3) camera pose setting, unlike the baselines. The results also indicate this method can be complementary to classical structure-from-motion (SfM) pipelines as it compares favorably to SfM on low-texture and low-resolution images
- β¦