1,859 research outputs found

    A Sampling Approach to Generating Closely Interacting 3D Pose-pairs from 2D Annotations

    Get PDF
    We introduce a data-driven method to generate a large number of plausible, closely interacting 3D human pose-pairs, for a given motion category, e.g., wrestling or salsa dance. With much difficulty in acquiring close interactions using 3D sensors, our approach utilizes abundant existing video data which cover many human activities. Instead of treating the data generation problem as one of reconstruction, either through 3D acquisition or direct 2D-to-3D data lifting from video annotations, we present a solution based on Markov Chain Monte Carlo (MCMC) sampling. With a focus on efficient sampling over the space of close interactions, rather than pose spaces, we develop a novel representation called interaction coordinates (IC) to encode both poses and their interactions in an integrated manner. Plausibility of a 3D pose-pair is then defined based on the ICs and with respect to the annotated 2D pose-pairs from video. We show that our sampling-based approach is able to efficiently synthesize a large volume of plausible, closely interacting 3D pose-pairs which provide a good coverage of the input 2D pose-pairs

    Hand Keypoint Detection in Single Images using Multiview Bootstrapping

    Full text link
    We present an approach that uses a multi-camera system to train fine-grained detectors for keypoints that are prone to occlusion, such as the joints of a hand. We call this procedure multiview bootstrapping: first, an initial keypoint detector is used to produce noisy labels in multiple views of the hand. The noisy detections are then triangulated in 3D using multiview geometry or marked as outliers. Finally, the reprojected triangulations are used as new labeled training data to improve the detector. We repeat this process, generating more labeled data in each iteration. We derive a result analytically relating the minimum number of views to achieve target true and false positive rates for a given detector. The method is used to train a hand keypoint detector for single images. The resulting keypoint detector runs in realtime on RGB images and has accuracy comparable to methods that use depth sensors. The single view detector, triangulated over multiple views, enables 3D markerless hand motion capture with complex object interactions.Comment: CVPR 201

    Generative Proxemics: A Prior for 3D Social Interaction from Images

    Full text link
    Social interaction is a fundamental aspect of human behavior and communication. The way individuals position themselves in relation to others, also known as proxemics, conveys social cues and affects the dynamics of social interaction. Reconstructing such interaction from images presents challenges because of mutual occlusion and the limited availability of large training datasets. To address this, we present a novel approach that learns a prior over the 3D proxemics two people in close social interaction and demonstrate its use for single-view 3D reconstruction. We start by creating 3D training data of interacting people using image datasets with contact annotations. We then model the proxemics using a novel denoising diffusion model called BUDDI that learns the joint distribution over the poses of two people in close social interaction. Sampling from our generative proxemics model produces realistic 3D human interactions, which we validate through a perceptual study. We use BUDDI in reconstructing two people in close proximity from a single image without any contact annotation via an optimization approach that uses the diffusion model as a prior. Our approach recovers accurate and plausible 3D social interactions from noisy initial estimates, outperforming state-of-the-art methods. Our code, data, and model are availableat our project website at: muelea.github.io/buddi.Comment: Project website: muelea.github.io/budd

    Crowdsourcing in Computer Vision

    Full text link
    Computer vision systems require large amounts of manually annotated data to properly learn challenging visual concepts. Crowdsourcing platforms offer an inexpensive method to capture human knowledge and understanding, for a vast number of visual perception tasks. In this survey, we describe the types of annotations computer vision researchers have collected using crowdsourcing, and how they have ensured that this data is of high quality while annotation effort is minimized. We begin by discussing data collection on both classic (e.g., object recognition) and recent (e.g., visual story-telling) vision tasks. We then summarize key design decisions for creating effective data collection interfaces and workflows, and present strategies for intelligently selecting the most important data instances to annotate. Finally, we conclude with some thoughts on the future of crowdsourcing in computer vision.Comment: A 69-page meta review of the field, Foundations and Trends in Computer Graphics and Vision, 201

    신체 μž„λ² λ”©μ„ ν™œμš©ν•œ μ˜€ν† μΈμ½”λ” 기반 컴퓨터 λΉ„μ „ λͺ¨ν˜•μ˜ μ„±λŠ₯ κ°œμ„ 

    Get PDF
    ν•™μœ„λ…Όλ¬Έ(박사) -- μ„œμšΈλŒ€ν•™κ΅λŒ€ν•™μ› : κ³΅κ³ΌλŒ€ν•™ 산업곡학과, 2021.8. λ°•μ’…ν—Œ.Deep learning models have dominated the field of computer vision, achieving state-of-the-art performance in various tasks. In particular, with recent increases in images and videos of people being posted on social media, research on computer vision tasks for analyzing human visual information is being used in various ways. This thesis addresses classifying fashion styles and measuring motion similarity as two computer vision tasks related to humans. In real-world fashion style classification problems, the number of samples collected for each style class varies according to the fashion trend at the time of data collection, resulting in class imbalance. In this thesis, to cope with this class imbalance problem, generalized few-shot learning, in which both minority classes and majority classes are used for learning and evaluation, is employed. Additionally, the modalities of the foreground images, cropped to show only the body and fashion item parts, and the fashion attribute information are reflected in the fashion image embedding through a variational autoencoder. The K-fashion dataset collected from a Korean fashion shopping mall is used for the model training and evaluation. Motion similarity measurement is used as a sub-module in various tasks such as action recognition, anomaly detection, and person re-identification; however, it has attracted less attention than the other tasks because the same motion can be represented differently depending on the performer's body structure and camera angle. The lack of public datasets for model training and evaluation also makes research challenging. Therefore, we propose an artificial dataset for model training, with motion embeddings separated from the body structure and camera angle attributes for training using an autoencoder architecture. The autoencoder is designed to generate motion embeddings for each body part to measure motion similarity by body part. Furthermore, motion speed is synchronized by matching patches performing similar motions using dynamic time warping. The similarity score dataset for evaluation was collected through a crowdsourcing platform utilizing videos of NTU RGB+D 120, a dataset for action recognition. When the proposed models were verified with each evaluation dataset, both outperformed the baselines. In the fashion style classification problem, the proposed model showed the most balanced performance, without bias toward either the minority classes or the majority classes, among all the models. In addition, In the motion similarity measurement experiments, the correlation coefficient of the proposed model to the human-measured similarity score was higher than that of the baselines.컴퓨터 비전은 λ”₯λŸ¬λ‹ ν•™μŠ΅ 방법둠이 강점을 λ³΄μ΄λŠ” λΆ„μ•Όλ‘œ, λ‹€μ–‘ν•œ νƒœμŠ€ν¬μ—μ„œ μš°μˆ˜ν•œ μ„±λŠ₯을 보이고 μžˆλ‹€. 특히, μ‚¬λžŒμ΄ ν¬ν•¨λœ μ΄λ―Έμ§€λ‚˜ λ™μ˜μƒμ„ λ”₯λŸ¬λ‹μ„ 톡해 λΆ„μ„ν•˜λŠ” νƒœμŠ€ν¬μ˜ 경우, 졜근 μ†Œμ…œ 미디어에 μ‚¬λžŒμ΄ ν¬ν•¨λœ 이미지 λ˜λŠ” λ™μ˜μƒ κ²Œμ‹œλ¬Όμ΄ λŠ˜μ–΄λ‚˜λ©΄μ„œ κ·Έ ν™œμš© κ°€μΉ˜κ°€ 높아지고 μžˆλ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” μ‚¬λžŒκ³Ό κ΄€λ ¨λœ 컴퓨터 λΉ„μ „ νƒœμŠ€ν¬ 쀑 νŒ¨μ…˜ μŠ€νƒ€μΌ λΆ„λ₯˜ λ¬Έμ œμ™€ λ™μž‘ μœ μ‚¬λ„ 츑정에 λŒ€ν•΄ 닀룬닀. νŒ¨μ…˜ μŠ€νƒ€μΌ λΆ„λ₯˜ 문제의 경우, 데이터 μˆ˜μ§‘ μ‹œμ μ˜ νŒ¨μ…˜ μœ ν–‰μ— 따라 μŠ€νƒ€μΌ ν΄λž˜μŠ€λ³„ μˆ˜μ§‘λ˜λŠ” μƒ˜ν”Œμ˜ 양이 달라지기 λ•Œλ¬Έμ— μ΄λ‘œλΆ€ν„° 클래슀 λΆˆκ· ν˜•μ΄ λ°œμƒν•œλ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” μ΄λŸ¬ν•œ 클래슀 λΆˆκ· ν˜• λ¬Έμ œμ— λŒ€μ²˜ν•˜κΈ° μœ„ν•˜μ—¬, μ†Œμˆ˜ μƒ˜ν”Œ ν΄λž˜μŠ€μ™€ λ‹€μˆ˜ μƒ˜ν”Œ 클래슀λ₯Ό ν•™μŠ΅ 및 평가에 λͺ¨λ‘ μ‚¬μš©ν•˜λŠ” μΌλ°˜ν™”λœ ν“¨μƒ·λŸ¬λ‹μœΌλ‘œ νŒ¨μ…˜ μŠ€νƒ€μΌ λΆ„λ₯˜ 문제λ₯Ό μ„€μ •ν•˜μ˜€λ‹€. λ˜ν•œ λ³€λΆ„ μ˜€ν† μΈμ½”λ” 기반의 λͺ¨λΈμ„ 톡해, 신체 및 νŒ¨μ…˜ μ•„μ΄ν…œ λΆ€λΆ„λ§Œ μž˜λΌλ‚Έ μ „κ²½ 이미지 λͺ¨λ‹¬λ¦¬ν‹°μ™€ νŒ¨μ…˜ 속성 정보 λͺ¨λ‹¬λ¦¬ν‹°κ°€ νŒ¨μ…˜ μ΄λ―Έμ§€μ˜ μž„λ² λ”© ν•™μŠ΅μ— λ°˜μ˜λ˜λ„λ‘ ν•˜μ˜€λ‹€. ν•™μŠ΅ 및 평가λ₯Ό μœ„ν•œ λ°μ΄ν„°μ…‹μœΌλ‘œλŠ” ν•œκ΅­ νŒ¨μ…˜ μ‡Όν•‘λͺ°μ—μ„œ μˆ˜μ§‘λœ K-fashion 데이터셋을 μ‚¬μš©ν•˜μ˜€λ‹€. ν•œνŽΈ, λ™μž‘ μœ μ‚¬λ„ 츑정은 ν–‰μœ„ 인식, 이상 λ™μž‘ 감지, μ‚¬λžŒ μž¬μΈμ‹ 같은 λ‹€μ–‘ν•œ λΆ„μ•Όμ˜ ν•˜μœ„ λͺ¨λ“ˆλ‘œ ν™œμš©λ˜κ³  μžˆμ§€λ§Œ κ·Έ μžμ²΄κ°€ μ—°κ΅¬λœ 적은 λ§Žμ§€ μ•Šμ€λ°, μ΄λŠ” 같은 λ™μž‘μ„ μˆ˜ν–‰ν•˜λ”λΌλ„ 신체 ꡬ쑰 및 카메라 각도에 따라 λ‹€λ₯΄κ²Œ ν‘œν˜„λ  수 μžˆλ‹€λŠ” 점으둜 λΆ€ν„° κΈ°μΈν•œλ‹€. ν•™μŠ΅ 및 평가λ₯Ό μœ„ν•œ 곡개 데이터셋이 λ§Žμ§€ μ•Šλ‹€λŠ” 점 λ˜ν•œ 연ꡬλ₯Ό μ–΄λ ΅κ²Œ ν•˜λŠ” μš”μΈμ΄λ‹€. λ”°λΌμ„œ λ³Έ λ…Όλ¬Έμ—μ„œλŠ” ν•™μŠ΅μ„ μœ„ν•œ 인곡 데이터셋을 μˆ˜μ§‘ν•˜μ—¬ μ˜€ν† μΈμ½”λ” ꡬ쑰λ₯Ό 톡해 신체 ꡬ쑰 및 카메라 각도 μš”μ†Œκ°€ λΆ„λ¦¬λœ λ™μž‘ μž„λ² λ”©μ„ ν•™μŠ΅ν•˜μ˜€λ‹€. μ΄λ•Œ, 각 신체 λΆ€μœ„λ³„λ‘œ λ™μž‘ μž„λ² λ”©μ„ 생성할 수 μžˆλ„λ‘ν•˜μ—¬ 신체 λΆ€μœ„λ³„λ‘œ λ™μž‘ μœ μ‚¬λ„ 츑정이 κ°€λŠ₯ν•˜λ„λ‘ ν•˜μ˜€λ‹€. 두 λ™μž‘ μ‚¬μ΄μ˜ μœ μ‚¬λ„λ₯Ό μΈ‘μ •ν•  λ•Œμ—λŠ” 동적 μ‹œκ°„ μ›Œν•‘ 기법을 μ‚¬μš©, λΉ„μŠ·ν•œ λ™μž‘μ„ μˆ˜ν–‰ν•˜λŠ” ꡬ간끼리 μ •λ ¬μ‹œμΌœ μœ μ‚¬λ„λ₯Ό μΈ‘μ •ν•˜λ„λ‘ ν•¨μœΌλ‘œμ¨, λ™μž‘ μˆ˜ν–‰ μ†λ„μ˜ 차이λ₯Ό λ³΄μ •ν•˜μ˜€λ‹€. 평가λ₯Ό μœ„ν•œ μœ μ‚¬λ„ 점수 데이터셋은 ν–‰μœ„ 인식 데이터셋인 NTU-RGB+D 120의 μ˜μƒμ„ ν™œμš©ν•˜μ—¬ ν¬λΌμš°λ“œ μ†Œμ‹± ν”Œλž«νΌμ„ 톡해 μˆ˜μ§‘λ˜μ—ˆλ‹€. 두 가지 νƒœμŠ€ν¬μ˜ μ œμ•ˆ λͺ¨λΈμ„ 각각의 평가 λ°μ΄ν„°μ…‹μœΌλ‘œ κ²€μ¦ν•œ κ²°κ³Ό, λͺ¨λ‘ 비ꡐ λͺ¨λΈ λŒ€λΉ„ μš°μˆ˜ν•œ μ„±λŠ₯을 κΈ°λ‘ν•˜μ˜€λ‹€. νŒ¨μ…˜ μŠ€νƒ€μΌ λΆ„λ₯˜ 문제의 경우, λͺ¨λ“  λΉ„κ΅κ΅°μ—μ„œ μ†Œμˆ˜ μƒ˜ν”Œ ν΄λž˜μŠ€μ™€ λ‹€μˆ˜ μƒ˜ν”Œ 클래슀 쀑 ν•œ μͺ½μœΌλ‘œ μΉ˜μš°μΉ˜μ§€ μ•ŠλŠ” κ°€μž₯ κ· ν˜•μž‘νžŒ μΆ”λ‘  μ„±λŠ₯을 λ³΄μ—¬μ£Όμ—ˆκ³ , λ™μž‘ μœ μ‚¬λ„ μΈ‘μ •μ˜ 경우 μ‚¬λžŒμ΄ μΈ‘μ •ν•œ μœ μ‚¬λ„ μ μˆ˜μ™€ μƒκ΄€κ³„μˆ˜μ—μ„œ 비ꡐ λͺ¨λΈ λŒ€λΉ„ 더 높은 수치λ₯Ό λ‚˜νƒ€λ‚΄μ—ˆλ‹€.Chapter 1 Introduction 1 1.1 Background and motivation 1 1.2 Research contribution 5 1.2.1 Fashion style classication 5 1.2.2 Human motion similarity 9 1.2.3 Summary of the contributions 11 1.3 Thesis outline 13 Chapter 2 Literature Review 14 2.1 Fashion style classication 14 2.1.1 Machine learning and deep learning-based approaches 14 2.1.2 Class imbalance 15 2.1.3 Variational autoencoder 17 2.2 Human motion similarity 19 2.2.1 Measuring the similarity between two people 19 2.2.2 Human body embedding 20 2.2.3 Datasets for measuring the similarity 20 2.2.4 Triplet and quadruplet losses 21 2.2.5 Dynamic time warping 22 Chapter 3 Fashion Style Classication 24 3.1 Dataset for fashion style classication: K-fashion 24 3.2 Multimodal variational inference for fashion style classication 28 3.2.1 CADA-VAE 31 3.2.2 Generating multimodal features 33 3.2.3 Classier training with cyclic oversampling 36 3.3 Experimental results for fashion style classication 38 3.3.1 Implementation details 38 3.3.2 Settings for experiments 42 3.3.3 Experimental results on K-fashion 44 3.3.4 Qualitative analysis 48 3.3.5 Eectiveness of the cyclic oversampling 50 Chapter 4 Motion Similarity Measurement 53 4.1 Datasets for motion similarity 53 4.1.1 Synthetic motion dataset: SARA dataset 53 4.1.2 NTU RGB+D 120 similarity annotations 55 4.2 Framework for measuring motion similarity 58 4.2.1 Body part embedding model 58 4.2.2 Measuring motion similarity 67 4.3 Experimental results for measuring motion similarity 68 4.3.1 Implementation details 68 4.3.2 Experimental results on NTU RGB+D 120 similarity annotations 72 4.3.3 Visualization of motion latent clusters 78 4.4 Application 81 4.4.1 Real-world application with dancing videos 81 4.4.2 Tuning similarity scores to match human perception 87 Chapter 5 Conclusions 89 5.1 Summary and contributions 89 5.2 Limitations and future research 91 Appendices 93 Chapter A NTU RGB+D 120 Similarity Annotations 94 A.1 Data collection 94 A.2 AMT score analysis 95 Chapter B Data Cleansing of NTU RGB+D 120 Skeletal Data 100 Chapter C Motion Sequence Generation Using Mixamo 102 Bibliography 104 ꡭ문초둝 123λ°•

    Vision for Social Robots: Human Perception and Pose Estimation

    Get PDF
    In order to extract the underlying meaning from a scene captured from the surrounding world in a single still image, social robots will need to learn the human ability to detect different objects, understand their arrangement and relationships relative both to their own parts and to each other, and infer the dynamics under which they are evolving. Furthermore, they will need to develop and hold a notion of context to allow assigning different meanings (semantics) to the same visual configuration (syntax) of a scene. The underlying thread of this Thesis is the investigation of new ways for enabling interactions between social robots and humans, by advancing the visual perception capabilities of robots when they process images and videos in which humans are the main focus of attention. First, we analyze the general problem of scene understanding, as social robots moving through the world need to be able to interpret scenes without having been assigned a specific preset goal. Throughout this line of research, i) we observe that human actions and interactions which can be visually discriminated from an image follow a very heavy-tailed distribution; ii) we develop an algorithm that can obtain a spatial understanding of a scene by only using cues arising from the effect of perspective on a picture of a person’s face; and iii) we define a novel taxonomy of errors for the task of estimating the 2D body pose of people in images to better explain the behavior of algorithms and highlight their underlying causes of error. Second, we focus on the specific task of 3D human pose and motion estimation from monocular 2D images using weakly supervised training data, as accurately predicting human pose will open up the possibility of richer interactions between humans and social robots. We show that when 3D ground-truth data is only available in small quantities, or not at all, it is possible to leverage knowledge about the physical properties of the human body, along with additional constraints related to alternative types of supervisory signals, to learn models that can regress the full 3D pose of the human body and predict its motions from monocular 2D images. Taken in its entirety, the intent of this Thesis is to highlight the importance of, and provide novel methodologies for, social robots' ability to interpret their surrounding environment, learn in a way that is robust to low data availability, and generalize previously observed behaviors to unknown situations in a similar way to humans.</p

    Pushing the envelope for estimating poses and actions via full 3D reconstruction

    Get PDF
    Estimating poses and actions of human bodies and hands is an important task in the computer vision community due to its vast applications, including human computer interaction, virtual reality and augmented reality, medical image analysis. Challenges: There are many in-the-wild challenges in this task (see chapter 1). Among them, in this thesis, we focused on two challenges which could be relieved by incorporating the 3D geometry: (1) inherent 2D-to-3D ambiguity driven by the non-linear 2D projection process when capturing 3D objects. (2) lack of sufficient and quality annotated datasets due to the high-dimensionality of subjects' attribute space and inherent difficulty in annotating 3D coordinate values. Contributions: We first tried to jointly tackle the 2D-to-3D ambiguity and insufficient data issues by (1) explicitly reconstructing 2.5D and 3D samples and use them as new training data to train a pose estimator. Next, we tried to (2) encode 3D geometry in the training process of the action recognizer to reduce the 2D-to-3D ambiguity. In appendix, we proposed a (3) new hand pose synthetic dataset that can be used for more complete attribute changes and multi-modal experiments in the future. Experiments: Throughout experiments, we found interesting facts: (1) 2.5D depth map reconstruction and data augmentation can improve the accuracy of the depth-based hand pose estimation algorithm, (2) 3D mesh reconstruction can be used to generate a new RGB data and it improves the accuracy of RGB-based dense hand pose estimation algorithm, (3) 3D geometry from 3D poses and scene layouts could be successfully utilized to reduce the 2D-to-3D ambiguity in the action recognition problem.Open Acces
    • …
    corecore