776 research outputs found

    On Symbiosis of Attribute Prediction and Semantic Segmentation

    Full text link
    In this paper, we propose to employ semantic segmentation to improve person-related attribute prediction. The core idea lies in the fact that the probability of an attribute to appear in an image is far from being uniform in the spatial domain. We build our attribute prediction model jointly with a deep semantic segmentation network. This harnesses the localization cues learned by the semantic segmentation to guide the attention of the attribute prediction to the regions where different attributes naturally show up. Therefore, in addition to prediction, we are able to localize the attributes despite merely having access to image-level labels (weak supervision) during training. We first propose semantic segmentation-based pooling and gating, respectively denoted as SSP and SSG. In the former, the estimated segmentation masks are used to pool the final activations of the attribute prediction network, from multiple semantically homogeneous regions. In SSG, the same idea is applied to the intermediate layers of the network. SSP and SSG, while effective, impose heavy memory utilization since each channel of the activations is pooled/gated with all the semantic segmentation masks. To circumvent this, we propose Symbiotic Augmentation (SA), where we learn only one mask per activation channel. SA allows the model to either pick one, or combine (weighted superposition) multiple semantic maps, in order to generate the proper mask for each channel. SA simultaneously applies the same mechanism to the reverse problem by leveraging output logits of attribute prediction to guide the semantic segmentation task. We evaluate our proposed methods for facial attributes on CelebA and LFWA datasets, while benchmarking WIDER Attribute and Berkeley Attributes of People for whole body attributes. Our proposed methods achieve superior results compared to the previous works.Comment: Accepted for publication in PAMI. arXiv admin note: substantial text overlap with arXiv:1704.0874

    PoNA: Pose-guided non-local attention for human pose transfer

    Get PDF
    Human pose transfer, which aims at transferring the appearance of a given person to a target pose, is very challenging and important in many applications. Previous work ignores the guidance of pose features or only uses local attention mechanism, leading to implausible and blurry results. We propose a new human pose transfer method using a generative adversarial network (GAN) with simplified cascaded blocks. In each block, we propose a pose-guided non-local attention (PoNA) mechanism with a long-range dependency scheme to select more important regions of image features to transfer. We also design pre-posed image-guided pose feature update and post-posed pose-guided image feature update to better utilize the pose and image features. Our network is simple, stable, and easy to train. Quantitative and qualitative results on Market-1501 and DeepFashion datasets show the efficacy and efficiency of our model. Compared with state-of-the-art methods, our model generates sharper and more realistic images with rich details, while having fewer parameters and faster speed. Furthermore, our generated images can help to alleviate data insufficiency for person re-identification

    Slim DensePose: Thrifty Learning from Sparse Annotations and Motion Cues

    Get PDF
    DensePose supersedes traditional landmark detectors by densely mapping image pixels to body surface coordinates. This power, however, comes at a greatly increased annotation time, as supervising the model requires to manually label hundreds of points per pose instance. In this work, we thus seek methods to significantly slim down the DensePose annotations, proposing more efficient data collection strategies. In particular, we demonstrate that if annotations are collected in video frames, their efficacy can be multiplied for free by using motion cues. To explore this idea, we introduce DensePose-Track, a dataset of videos where selected frames are annotated in the traditional DensePose manner. Then, building on geometric properties of the DensePose mapping, we use the video dynamic to propagate ground-truth annotations in time as well as to learn from Siamese equivariance constraints. Having performed exhaustive empirical evaluation of various data annotation and learning strategies, we demonstrate that doing so can deliver significantly improved pose estimation results over strong baselines. However, despite what is suggested by some recent works, we show that merely synthesizing motion patterns by applying geometric transformations to isolated frames is significantly less effective, and that motion cues help much more when they are extracted from videos.Comment: CVPR 201

    μ˜μƒμ‹œμ°© ν•™μŠ΅:옷과 μ‚¬λžŒμ˜ 뢄리λ₯Ό 톡해 옷의 상세 정보λ₯Ό μ‚΄λ €μ£ΌλŠ” μ‹œμ°©λ°©λ²•

    Get PDF
    ν•™μœ„λ…Όλ¬Έ(석사) -- μ„œμšΈλŒ€ν•™κ΅λŒ€ν•™μ› : λ°μ΄ν„°μ‚¬μ΄μ–ΈμŠ€λŒ€ν•™μ› λ°μ΄ν„°μ‚¬μ΄μ–ΈμŠ€ν•™κ³Ό, 2023. 2. 이쀀석.Virtual try-on, fitting an image of a garment to an image of a person, has rapidly progressed recently. However, existing virtual try-on methods still struggle to faithfully represent various details of the clothes when worn. In this paper, we propose a simple yet effective method to better preserve details of the clothing and person by introducing an additional fitting step after geometric warping. This minimal modification helps to effectively learn disentangled representations of the clothing from the wearer. By disentangling these two major components for virtual try-on, we are able to preserve the wearer-agnostic structure and details of the clothing, and thus can fit a garment naturally to a variety of poses and body shapes. Moreover, we propose a novel evaluation framework applicable to any metric, to better reflect the semantics of clothes fitting. From extensive experiments, we empirically verify that the proposed method not only learns to disentangle clothing from the wearer, but also preserves details of the clothing on the try-on results.μ›ν•˜λŠ” 옷의 이미지λ₯Ό μ‚¬λžŒμ˜ 이미지에 λ§žμΆ”μ–΄ μž…ν˜€μ£ΌλŠ” 가상 의λ₯˜ μ‹œμ°© κΈ°μˆ μ€ 졜근 λΉ λ₯΄κ²Œ λ°œμ „ν•˜κ³  μžˆλ‹€. ν•˜μ§€λ§Œ ν˜„μž¬ μ‘΄μž¬ν•˜λŠ” 가상 μ‹œμ°© 방법듀은 옷이 μž…ν˜€μ‘Œμ„ λ•Œμ˜ λ””ν…ŒμΌμ„ 잘 ν‘œν˜„ν•˜μ§€ λͺ»ν•˜κ³  μžˆλ‹€. λ³Έ λ…Όλ¬Έμ—μ„œ, λ‚˜λŠ” κΈ°ν•˜ν•™μ  λ³€ν˜• 단계이후에 좔가적인 ν”ΌνŒ… 단계λ₯Ό μΆ”κ°€ν•¨μœΌλ‘œμ¨ 옷의 λ””ν…ŒμΌμ„ 보닀 잘 살릴 수 μžˆλŠ” λ‹¨μˆœν•˜μ§€λ§Œ 효과적인 방법을 μ œμ•ˆν•œλ‹€. 이 μ΅œμ†Œν•œμ˜ λ³€ν™”λŠ” μ‚¬λžŒκ³Ό μ˜·μ„ 효과적으둜 λΆ„λ¦¬ν•˜λŠ” 방법을 μ†Œκ°œν•˜λ©°, 이λ₯Ό 톡해 옷의 λ””ν…ŒμΌμ„ μžμ—°μŠ€λŸ½κ³  효과적으둜 ν•©μ„±ν•  수 μžˆλ‹€. λ˜ν•œ, λ³Έ λ…Όλ¬Έμ—μ„œ κΈ°μ‘΄ 츑정방법에 μ μš©ν•  수 μžˆλŠ” μƒˆλ‘œμš΄ 평가방법을 μ œμ•ˆν•œλ‹€. μ‹€ν—˜μ„ 톡해 λ³Έ λ…Όλ¬Έμ—μ„œ μ œμ‹œν•œ 방법이 옷과 μ‚¬λžŒμ„ 효과적으둜 뢄리할 뿐만 μ•„λ‹ˆλΌ, 옷의 λ””ν…ŒμΌμ„ μžμ—°μŠ€λŸ½κ²Œ 보쑴함을 λ³΄μ˜€λ‹€.Chapter 1. Introduction 3 Chapter 2. Related Work 5 Chapter 3. Preliminary 6 Chapter 4. The Proposed Method: DP-VTON 8 Chapter 5. Experiments 12 Chapter 6. Summary 19 Chapter 7. Supplementary materials 19 Bibliography 30 Abstract in Korean 35석

    Dataset Pre-Processing and Artificial Augmentation, Network Architecture and Training Parameters used in Appropriate Training of Convolutional Neural Networks for Classification Based Computer Vision Applications: A Survey

    Full text link
    Training a Convolutional Neural Network (CNN) based classifier is dependent on a large number of factors. These factors involve tasks such as aggregation of apt dataset, arriving at a suitable CNN network, processing of the dataset, and selecting the training parameters to arrive at the desired classification results. This review includes pre-processing techniques and dataset augmentation techniques used in various CNN based classification researches. In many classification problems, it is usually observed that the quality of dataset is responsible for proper training of CNN network, and this quality is judged on the basis of variations in data for every class. It is not usual to find such a pre-made dataset due to many natural concerns. Also it is recommended to have a large dataset, which is again not usually made available directly as a dataset. In some cases, the noise present in the dataset may not prove useful for training, while in others, researchers prefer to add noise to certain images to make the network less vulnerable to unwanted variations. Hence, researchers use artificial digital imaging techniques to derive variations in the dataset and clear or add noise. Thus, the presented paper accumulates state-of-the-art works that used the pre-processing and artificial augmentation of dataset before training. The next part to data augmentation is training, which includes proper selection of several parameters and a suitable CNN architecture. This paper also includes such network characteristics, dataset characteristics and training methodologies used in biomedical imaging, vision modules of autonomous driverless cars, and a few general vision based applications

    Describing Images by Semantic Modeling using Attributes and Tags

    Get PDF
    This dissertation addresses the problem of describing images using visual attributes and textual tags, a fundamental task that narrows down the semantic gap between the visual reasoning of humans and machines. Automatic image annotation assigns relevant textual tags to the images. In this dissertation, we propose a query-specific formulation based on Weighted Multi-view Non-negative Matrix Factorization to perform automatic image annotation. Our proposed technique seamlessly adapt to the changes in training data, naturally solves the problem of feature fusion and handles the challenge of the rare tags. Unlike tags, attributes are category-agnostic, hence their combination models an exponential number of semantic labels. Motivated by the fact that most attributes describe local properties, we propose exploiting localization cues, through semantic parsing of human face and body to improve person-related attribute prediction. We also demonstrate that image-level attribute labels can be effectively used as weak supervision for the task of semantic segmentation. Next, we analyze the Selfie images by utilizing tags and attributes. We collect the first large-scale Selfie dataset and annotate it with different attributes covering characteristics such as gender, age, race, facial gestures, and hairstyle. We then study the popularity and sentiments of the selfies given an estimated appearance of various semantic concepts. In brief, we automatically infer what makes a good selfie. Despite its extensive usage, the deep learning literature falls short in understanding the characteristics and behavior of the Batch Normalization. We conclude this dissertation by providing a fresh view, in light of information geometry and Fisher kernels to why the batch normalization works. We propose Mixture Normalization that disentangles modes of variation in the underlying distribution of the layer outputs and confirm that it effectively accelerates training of different batch-normalized architectures including Inception-V3, Densely Connected Networks, and Deep Convolutional Generative Adversarial Networks while achieving better generalization error
    • …
    corecore