Search CORE

776 research outputs found

On Symbiosis of Attribute Prediction and Semantic Segmentation

Author: Kalayeh Mahdi M.
Shah Mubarak
Publication venue
Publication date: 01/01/2019
Field of study

In this paper, we propose to employ semantic segmentation to improve person-related attribute prediction. The core idea lies in the fact that the probability of an attribute to appear in an image is far from being uniform in the spatial domain. We build our attribute prediction model jointly with a deep semantic segmentation network. This harnesses the localization cues learned by the semantic segmentation to guide the attention of the attribute prediction to the regions where different attributes naturally show up. Therefore, in addition to prediction, we are able to localize the attributes despite merely having access to image-level labels (weak supervision) during training. We first propose semantic segmentation-based pooling and gating, respectively denoted as SSP and SSG. In the former, the estimated segmentation masks are used to pool the final activations of the attribute prediction network, from multiple semantically homogeneous regions. In SSG, the same idea is applied to the intermediate layers of the network. SSP and SSG, while effective, impose heavy memory utilization since each channel of the activations is pooled/gated with all the semantic segmentation masks. To circumvent this, we propose Symbiotic Augmentation (SA), where we learn only one mask per activation channel. SA allows the model to either pick one, or combine (weighted superposition) multiple semantic maps, in order to generate the proper mask for each channel. SA simultaneously applies the same mechanism to the reverse problem by leveraging output logits of attribute prediction to guide the semantic segmentation task. We evaluate our proposed methods for facial attributes on CelebA and LFWA datasets, while benchmarking WIDER Attribute and Berkeley Attributes of People for whole body attributes. Our proposed methods achieve superior results compared to the previous works.Comment: Accepted for publication in PAMI. arXiv admin note: substantial text overlap with arXiv:1704.0874

arXiv.org e-Print Archive

PoNA: Pose-guided non-local attention for human pose transfer

Author: Dai Qionghai
Lai Yu-Kun
Li Kun
Liu Yebin
Zhang Jinsong
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 13/10/2020
Field of study

Human pose transfer, which aims at transferring the appearance of a given person to a target pose, is very challenging and important in many applications. Previous work ignores the guidance of pose features or only uses local attention mechanism, leading to implausible and blurry results. We propose a new human pose transfer method using a generative adversarial network (GAN) with simplified cascaded blocks. In each block, we propose a pose-guided non-local attention (PoNA) mechanism with a long-range dependency scheme to select more important regions of image features to transfer. We also design pre-posed image-guided pose feature update and post-posed pose-guided image feature update to better utilize the pose and image features. Our network is simple, stable, and easy to train. Quantitative and qualitative results on Market-1501 and DeepFashion datasets show the efficacy and efficiency of our model. Compared with state-of-the-art methods, our model generates sharper and more realistic images with rich details, while having fewer parameters and faster speed. Furthermore, our generated images can help to alleviate data insufficiency for person re-identification

arXiv.org e-Print Archive

Slim DensePose: Thrifty Learning from Sparse Annotations and Motion Cues

Author: Güler Rıza Alp
Kokkinos Iasonas
Neverova Natalia
Thewlis James
Vedaldi Andrea
Publication venue
Publication date: 13/06/2019
Field of study

DensePose supersedes traditional landmark detectors by densely mapping image pixels to body surface coordinates. This power, however, comes at a greatly increased annotation time, as supervising the model requires to manually label hundreds of points per pose instance. In this work, we thus seek methods to significantly slim down the DensePose annotations, proposing more efficient data collection strategies. In particular, we demonstrate that if annotations are collected in video frames, their efficacy can be multiplied for free by using motion cues. To explore this idea, we introduce DensePose-Track, a dataset of videos where selected frames are annotated in the traditional DensePose manner. Then, building on geometric properties of the DensePose mapping, we use the video dynamic to propagate ground-truth annotations in time as well as to learn from Siamese equivariance constraints. Having performed exhaustive empirical evaluation of various data annotation and learning strategies, we demonstrate that doing so can deliver significantly improved pose estimation results over strong baselines. However, despite what is suggested by some recent works, we show that merely synthesizing motion patterns by applying geometric transformations to isolated frames is significantly less effective, and that motion cues help much more when they are extracted from videos.Comment: CVPR 201

arXiv.org e-Print Archive

Oxford University Research Archive

의상시착 학습:옷과 사람의 분리를 통해 옷의 상세 정보를 살려주는 시착방법

Author: Sangho Lee
Publication venue: 서울대학교 대학원
Publication date: 01/02/2023
Field of study

학위논문(석사) -- 서울대학교대학원 : 데이터사이언스대학원 데이터사이언스학과, 2023. 2. 이준석.Virtual try-on, fitting an image of a garment to an image of a person, has rapidly progressed recently. However, existing virtual try-on methods still struggle to faithfully represent various details of the clothes when worn. In this paper, we propose a simple yet effective method to better preserve details of the clothing and person by introducing an additional fitting step after geometric warping. This minimal modification helps to effectively learn disentangled representations of the clothing from the wearer. By disentangling these two major components for virtual try-on, we are able to preserve the wearer-agnostic structure and details of the clothing, and thus can fit a garment naturally to a variety of poses and body shapes. Moreover, we propose a novel evaluation framework applicable to any metric, to better reflect the semantics of clothes fitting. From extensive experiments, we empirically verify that the proposed method not only learns to disentangle clothing from the wearer, but also preserves details of the clothing on the try-on results.원하는 옷의 이미지를 사람의 이미지에 맞추어 입혀주는 가상 의류 시착 기술은 최근 빠르게 발전하고 있다. 하지만 현재 존재하는 가상 시착 방법들은 옷이 입혀졌을 때의 디테일을 잘 표현하지 못하고 있다. 본 논문에서, 나는 기하학적 변형 단계이후에 추가적인 피팅 단계를 추가함으로써 옷의 디테일을 보다 잘 살릴 수 있는 단순하지만 효과적인 방법을 제안한다. 이 최소한의 변화는 사람과 옷을 효과적으로 분리하는 방법을 소개하며, 이를 통해 옷의 디테일을 자연스럽고 효과적으로 합성할 수 있다. 또한, 본 논문에서 기존 측정방법에 적용할 수 있는 새로운 평가방법을 제안한다. 실험을 통해 본 논문에서 제시한 방법이 옷과 사람을 효과적으로 분리할 뿐만 아니라, 옷의 디테일을 자연스럽게 보존함을 보였다.Chapter 1. Introduction 3 Chapter 2. Related Work 5 Chapter 3. Preliminary 6 Chapter 4. The Proposed Method: DP-VTON 8 Chapter 5. Experiments 12 Chapter 6. Summary 19 Chapter 7. Supplementary materials 19 Bibliography 30 Abstract in Korean 35석

Dataset Pre-Processing and Artificial Augmentation, Network Architecture and Training Parameters used in Appropriate Training of Convolutional Neural Networks for Classification Based Computer Vision Applications: A Survey

Author: Annamraju A. K. (Abhishek)
Publication venue: 'Infogain Publication'
Publication date: 01/09/2016
Field of study

Training a Convolutional Neural Network (CNN) based classifier is dependent on a large number of factors. These factors involve tasks such as aggregation of apt dataset, arriving at a suitable CNN network, processing of the dataset, and selecting the training parameters to arrive at the desired classification results. This review includes pre-processing techniques and dataset augmentation techniques used in various CNN based classification researches. In many classification problems, it is usually observed that the quality of dataset is responsible for proper training of CNN network, and this quality is judged on the basis of variations in data for every class. It is not usual to find such a pre-made dataset due to many natural concerns. Also it is recommended to have a large dataset, which is again not usually made available directly as a dataset. In some cases, the noise present in the dataset may not prove useful for training, while in others, researchers prefer to add noise to certain images to make the network less vulnerable to unwanted variations. Hence, researchers use artificial digital imaging techniques to derive variations in the dataset and clear or add noise. Thus, the presented paper accumulates state-of-the-art works that used the pre-processing and artificial augmentation of dataset before training. The next part to data augmentation is training, which includes proper selection of several parameters and a suitable CNN architecture. This paper also includes such network characteristics, dataset characteristics and training methodologies used in biomedical imaging, vision modules of autonomous driverless cars, and a few general vision based applications

Describing Images by Semantic Modeling using Attributes and Tags

Author: Mahmoudkalayeh Mahdi
Publication venue: 'Information Bulletin on Variable Stars (IBVS)'
Publication date: 01/05/2019
Field of study

This dissertation addresses the problem of describing images using visual attributes and textual tags, a fundamental task that narrows down the semantic gap between the visual reasoning of humans and machines. Automatic image annotation assigns relevant textual tags to the images. In this dissertation, we propose a query-specific formulation based on Weighted Multi-view Non-negative Matrix Factorization to perform automatic image annotation. Our proposed technique seamlessly adapt to the changes in training data, naturally solves the problem of feature fusion and handles the challenge of the rare tags. Unlike tags, attributes are category-agnostic, hence their combination models an exponential number of semantic labels. Motivated by the fact that most attributes describe local properties, we propose exploiting localization cues, through semantic parsing of human face and body to improve person-related attribute prediction. We also demonstrate that image-level attribute labels can be effectively used as weak supervision for the task of semantic segmentation. Next, we analyze the Selfie images by utilizing tags and attributes. We collect the first large-scale Selfie dataset and annotate it with different attributes covering characteristics such as gender, age, race, facial gestures, and hairstyle. We then study the popularity and sentiments of the selfies given an estimated appearance of various semantic concepts. In brief, we automatically infer what makes a good selfie. Despite its extensive usage, the deep learning literature falls short in understanding the characteristics and behavior of the Batch Normalization. We conclude this dissertation by providing a fresh view, in light of information geometry and Fisher kernels to why the batch normalization works. We propose Mixture Normalization that disentangles modes of variation in the underlying distribution of the layer outputs and confirm that it effectively accelerates training of different batch-normalized architectures including Inception-V3, Densely Connected Networks, and Deep Convolutional Generative Adversarial Networks while achieving better generalization error