776 research outputs found
On Symbiosis of Attribute Prediction and Semantic Segmentation
In this paper, we propose to employ semantic segmentation to improve
person-related attribute prediction. The core idea lies in the fact that the
probability of an attribute to appear in an image is far from being uniform in
the spatial domain. We build our attribute prediction model jointly with a deep
semantic segmentation network. This harnesses the localization cues learned by
the semantic segmentation to guide the attention of the attribute prediction to
the regions where different attributes naturally show up. Therefore, in
addition to prediction, we are able to localize the attributes despite merely
having access to image-level labels (weak supervision) during training. We
first propose semantic segmentation-based pooling and gating, respectively
denoted as SSP and SSG. In the former, the estimated segmentation masks are
used to pool the final activations of the attribute prediction network, from
multiple semantically homogeneous regions. In SSG, the same idea is applied to
the intermediate layers of the network. SSP and SSG, while effective, impose
heavy memory utilization since each channel of the activations is pooled/gated
with all the semantic segmentation masks. To circumvent this, we propose
Symbiotic Augmentation (SA), where we learn only one mask per activation
channel. SA allows the model to either pick one, or combine (weighted
superposition) multiple semantic maps, in order to generate the proper mask for
each channel. SA simultaneously applies the same mechanism to the reverse
problem by leveraging output logits of attribute prediction to guide the
semantic segmentation task. We evaluate our proposed methods for facial
attributes on CelebA and LFWA datasets, while benchmarking WIDER Attribute and
Berkeley Attributes of People for whole body attributes. Our proposed methods
achieve superior results compared to the previous works.Comment: Accepted for publication in PAMI. arXiv admin note: substantial text
overlap with arXiv:1704.0874
PoNA: Pose-guided non-local attention for human pose transfer
Human pose transfer, which aims at transferring the appearance of a given person to a target pose, is very challenging and important in many applications. Previous work ignores the guidance of pose features or only uses local attention mechanism, leading to implausible and blurry results. We propose a new human pose transfer method using a generative adversarial network (GAN) with simplified cascaded blocks. In each block, we propose a pose-guided non-local attention (PoNA) mechanism with a long-range dependency scheme to select more important regions of image features to transfer. We also design pre-posed image-guided pose feature update and post-posed pose-guided image feature update to better utilize the pose and image features. Our network is simple, stable, and easy to train. Quantitative and qualitative results on Market-1501 and DeepFashion datasets show the efficacy and efficiency of our model. Compared with state-of-the-art methods, our model generates sharper and more realistic images with rich details, while having fewer parameters and faster speed. Furthermore, our generated images can help to alleviate data insufficiency for person re-identification
Slim DensePose: Thrifty Learning from Sparse Annotations and Motion Cues
DensePose supersedes traditional landmark detectors by densely mapping image
pixels to body surface coordinates. This power, however, comes at a greatly
increased annotation time, as supervising the model requires to manually label
hundreds of points per pose instance. In this work, we thus seek methods to
significantly slim down the DensePose annotations, proposing more efficient
data collection strategies. In particular, we demonstrate that if annotations
are collected in video frames, their efficacy can be multiplied for free by
using motion cues. To explore this idea, we introduce DensePose-Track, a
dataset of videos where selected frames are annotated in the traditional
DensePose manner. Then, building on geometric properties of the DensePose
mapping, we use the video dynamic to propagate ground-truth annotations in time
as well as to learn from Siamese equivariance constraints. Having performed
exhaustive empirical evaluation of various data annotation and learning
strategies, we demonstrate that doing so can deliver significantly improved
pose estimation results over strong baselines. However, despite what is
suggested by some recent works, we show that merely synthesizing motion
patterns by applying geometric transformations to isolated frames is
significantly less effective, and that motion cues help much more when they are
extracted from videos.Comment: CVPR 201
μμμμ°© νμ΅:μ·κ³Ό μ¬λμ λΆλ¦¬λ₯Ό ν΅ν΄ μ·μ μμΈ μ 보λ₯Ό μ΄λ €μ£Όλ μμ°©λ°©λ²
νμλ
Όλ¬Έ(μμ¬) -- μμΈλνκ΅λνμ : λ°μ΄ν°μ¬μ΄μΈμ€λνμ λ°μ΄ν°μ¬μ΄μΈμ€νκ³Ό, 2023. 2. μ΄μ€μ.Virtual try-on, fitting an image of a garment to an image of a person, has rapidly progressed recently. However, existing virtual try-on methods still struggle to faithfully represent various details of the clothes when worn. In this paper, we propose a simple yet effective method to better preserve details of the clothing
and person by introducing an additional fitting step after geometric warping.
This minimal modification helps to effectively learn disentangled representations of the clothing from the wearer. By disentangling these two major components for virtual try-on, we are able to preserve the wearer-agnostic structure and details of the clothing, and thus can fit a garment naturally to a variety of poses and body shapes. Moreover, we propose a novel evaluation framework applicable to any metric, to better reflect the semantics of clothes fitting. From extensive experiments, we empirically verify that the proposed method not only learns to disentangle clothing from the wearer, but also preserves details of the clothing on the try-on results.μνλ μ·μ μ΄λ―Έμ§λ₯Ό μ¬λμ μ΄λ―Έμ§μ λ§μΆμ΄ μ
νμ£Όλ κ°μ μλ₯ μμ°© κΈ°μ μ μ΅κ·Ό λΉ λ₯΄κ² λ°μ νκ³ μλ€. νμ§λ§ νμ¬ μ‘΄μ¬νλ κ°μ μμ°© λ°©λ²λ€μ μ·μ΄ μ
νμ‘μ λμ λν
μΌμ μ νννμ§ λͺ»νκ³ μλ€. λ³Έ λ
Όλ¬Έμμ, λλ κΈ°ννμ λ³ν λ¨κ³μ΄νμ μΆκ°μ μΈ νΌν
λ¨κ³λ₯Ό μΆκ°ν¨μΌλ‘μ¨ μ·μ λν
μΌμ λ³΄λ€ μ μ΄λ¦΄ μ μλ λ¨μνμ§λ§ ν¨κ³Όμ μΈ λ°©λ²μ μ μνλ€. μ΄ μ΅μνμ λ³νλ μ¬λκ³Ό μ·μ ν¨κ³Όμ μΌλ‘ λΆλ¦¬νλ λ°©λ²μ μκ°νλ©°, μ΄λ₯Ό ν΅ν΄ μ·μ λν
μΌμ μμ°μ€λ½κ³ ν¨κ³Όμ μΌλ‘ ν©μ±ν μ μλ€. λν, λ³Έ λ
Όλ¬Έμμ κΈ°μ‘΄ μΈ‘μ λ°©λ²μ μ μ©ν μ μλ μλ‘μ΄ νκ°λ°©λ²μ μ μνλ€. μ€νμ ν΅ν΄ λ³Έ λ
Όλ¬Έμμ μ μν λ°©λ²μ΄ μ·κ³Ό μ¬λμ ν¨κ³Όμ μΌλ‘ λΆλ¦¬ν λΏλ§ μλλΌ, μ·μ λν
μΌμ μμ°μ€λ½κ² 보쑴ν¨μ 보μλ€.Chapter 1. Introduction 3
Chapter 2. Related Work 5
Chapter 3. Preliminary 6
Chapter 4. The Proposed Method: DP-VTON 8
Chapter 5. Experiments 12
Chapter 6. Summary 19
Chapter 7. Supplementary materials 19
Bibliography 30
Abstract in Korean 35μ
Dataset Pre-Processing and Artificial Augmentation, Network Architecture and Training Parameters used in Appropriate Training of Convolutional Neural Networks for Classification Based Computer Vision Applications: A Survey
Training a Convolutional Neural Network (CNN) based classifier is dependent on a large number of factors. These factors involve tasks such as aggregation of apt dataset, arriving at a suitable CNN network, processing of the dataset, and selecting the training parameters to arrive at the desired classification results. This review includes pre-processing techniques and dataset augmentation techniques used in various CNN based classification researches. In many classification problems, it is usually observed that the quality of dataset is responsible for proper training of CNN network, and this quality is judged on the basis of variations in data for every class. It is not usual to find such a pre-made dataset due to many natural concerns. Also it is recommended to have a large dataset, which is again not usually made available directly as a dataset. In some cases, the noise present in the dataset may not prove useful for training, while in others, researchers prefer to add noise to certain images to make the network less vulnerable to unwanted variations. Hence, researchers use artificial digital imaging techniques to derive variations in the dataset and clear or add noise. Thus, the presented paper accumulates state-of-the-art works that used the pre-processing and artificial augmentation of dataset before training. The next part to data augmentation is training, which includes proper selection of several parameters and a suitable CNN architecture. This paper also includes such network characteristics, dataset characteristics and training methodologies used in biomedical imaging, vision modules of autonomous driverless cars, and a few general vision based applications
Describing Images by Semantic Modeling using Attributes and Tags
This dissertation addresses the problem of describing images using visual attributes and textual tags, a fundamental task that narrows down the semantic gap between the visual reasoning of humans and machines. Automatic image annotation assigns relevant textual tags to the images. In this dissertation, we propose a query-specific formulation based on Weighted Multi-view Non-negative Matrix Factorization to perform automatic image annotation. Our proposed technique seamlessly adapt to the changes in training data, naturally solves the problem of feature fusion and handles the challenge of the rare tags. Unlike tags, attributes are category-agnostic, hence their combination models an exponential number of semantic labels. Motivated by the fact that most attributes describe local properties, we propose exploiting localization cues, through semantic parsing of human face and body to improve person-related attribute prediction. We also demonstrate that image-level attribute labels can be effectively used as weak supervision for the task of semantic segmentation. Next, we analyze the Selfie images by utilizing tags and attributes. We collect the first large-scale Selfie dataset and annotate it with different attributes covering characteristics such as gender, age, race, facial gestures, and hairstyle. We then study the popularity and sentiments of the selfies given an estimated appearance of various semantic concepts. In brief, we automatically infer what makes a good selfie. Despite its extensive usage, the deep learning literature falls short in understanding the characteristics and behavior of the Batch Normalization. We conclude this dissertation by providing a fresh view, in light of information geometry and Fisher kernels to why the batch normalization works. We propose Mixture Normalization that disentangles modes of variation in the underlying distribution of the layer outputs and confirm that it effectively accelerates training of different batch-normalized architectures including Inception-V3, Densely Connected Networks, and Deep Convolutional Generative Adversarial Networks while achieving better generalization error
- β¦