3,404 research outputs found
Instance-level Facial Attributes Transfer with Geometry-Aware Flow
We address the problem of instance-level facial attribute transfer without
paired training data, e.g. faithfully transferring the exact mustache from a
source face to a target face. This is a more challenging task than the
conventional semantic-level attribute transfer, which only preserves the
generic attribute style instead of instance-level traits. We propose the use of
geometry-aware flow, which serves as a well-suited representation for modeling
the transformation between instance-level facial attributes. Specifically, we
leverage the facial landmarks as the geometric guidance to learn the
differentiable flows automatically, despite of the large pose gap existed.
Geometry-aware flow is able to warp the source face attribute into the target
face context and generate a warp-and-blend result. To compensate for the
potential appearance gap between source and target faces, we propose a
hallucination sub-network that produces an appearance residual to further
refine the warp-and-blend result. Finally, a cycle-consistency framework
consisting of both attribute transfer module and attribute removal module is
designed, so that abundant unpaired face images can be used as training data.
Extensive evaluations validate the capability of our approach in transferring
instance-level facial attributes faithfully across large pose and appearance
gaps. Thanks to the flow representation, our approach can readily be applied to
generate realistic details on high-resolution images.Comment: To appear in AAAI 2019. Code and models are available at:
https://github.com/wdyin/GeoGA
Deep Learning Face Attributes in the Wild
Predicting face attributes in the wild is challenging due to complex face
variations. We propose a novel deep learning framework for attribute prediction
in the wild. It cascades two CNNs, LNet and ANet, which are fine-tuned jointly
with attribute tags, but pre-trained differently. LNet is pre-trained by
massive general object categories for face localization, while ANet is
pre-trained by massive face identities for attribute prediction. This framework
not only outperforms the state-of-the-art with a large margin, but also reveals
valuable facts on learning face representation.
(1) It shows how the performances of face localization (LNet) and attribute
prediction (ANet) can be improved by different pre-training strategies.
(2) It reveals that although the filters of LNet are fine-tuned only with
image-level attribute tags, their response maps over entire images have strong
indication of face locations. This fact enables training LNet for face
localization with only image-level annotations, but without face bounding boxes
or landmarks, which are required by all attribute recognition works.
(3) It also demonstrates that the high-level hidden neurons of ANet
automatically discover semantic concepts after pre-training with massive face
identities, and such concepts are significantly enriched after fine-tuning with
attribute tags. Each attribute can be well explained with a sparse linear
combination of these concepts.Comment: To appear in International Conference on Computer Vision (ICCV) 201
Talking Face Generation by Adversarially Disentangled Audio-Visual Representation
Talking face generation aims to synthesize a sequence of face images that
correspond to a clip of speech. This is a challenging task because face
appearance variation and semantics of speech are coupled together in the subtle
movements of the talking face regions. Existing works either construct specific
face appearance model on specific subjects or model the transformation between
lip motion and speech. In this work, we integrate both aspects and enable
arbitrary-subject talking face generation by learning disentangled audio-visual
representation. We find that the talking face sequence is actually a
composition of both subject-related information and speech-related information.
These two spaces are then explicitly disentangled through a novel
associative-and-adversarial training process. This disentangled representation
has an advantage where both audio and video can serve as inputs for generation.
Extensive experiments show that the proposed approach generates realistic
talking face sequences on arbitrary subjects with much clearer lip motion
patterns than previous work. We also demonstrate the learned audio-visual
representation is extremely useful for the tasks of automatic lip reading and
audio-video retrieval.Comment: AAAI Conference on Artificial Intelligence (AAAI 2019) Oral
Presentation. Code, models, and video results are available on our webpage:
https://liuziwei7.github.io/projects/TalkingFace.htm
Mix-and-Match Tuning for Self-Supervised Semantic Segmentation
Deep convolutional networks for semantic image segmentation typically require
large-scale labeled data, e.g. ImageNet and MS COCO, for network pre-training.
To reduce annotation efforts, self-supervised semantic segmentation is recently
proposed to pre-train a network without any human-provided labels. The key of
this new form of learning is to design a proxy task (e.g. image colorization),
from which a discriminative loss can be formulated on unlabeled data. Many
proxy tasks, however, lack the critical supervision signals that could induce
discriminative representation for the target image segmentation task. Thus
self-supervision's performance is still far from that of supervised
pre-training. In this study, we overcome this limitation by incorporating a
"mix-and-match" (M&M) tuning stage in the self-supervision pipeline. The
proposed approach is readily pluggable to many self-supervision methods and
does not use more annotated samples than the original process. Yet, it is
capable of boosting the performance of target image segmentation task to
surpass fully-supervised pre-trained counterpart. The improvement is made
possible by better harnessing the limited pixel-wise annotations in the target
dataset. Specifically, we first introduce the "mix" stage, which sparsely
samples and mixes patches from the target set to reflect rich and diverse local
patch statistics of target images. A "match" stage then forms a class-wise
connected graph, which can be used to derive a strong triplet-based
discriminative loss for fine-tuning the network. Our paradigm follows the
standard practice in existing self-supervised studies and no extra data or
label is required. With the proposed M&M approach, for the first time, a
self-supervision method can achieve comparable or even better performance
compared to its ImageNet pre-trained counterpart on both PASCAL VOC2012 dataset
and CityScapes dataset.Comment: To appear in AAAI 2018 as a spotlight paper. More details at the
project page: http://mmlab.ie.cuhk.edu.hk/projects/M%26M
Distributed Estimation and Inference with Statistical Guarantees
This paper studies hypothesis testing and parameter estimation in the context
of the divide and conquer algorithm. In a unified likelihood based framework,
we propose new test statistics and point estimators obtained by aggregating
various statistics from subsamples of size , where is the sample
size. In both low dimensional and high dimensional settings, we address the
important question of how to choose as grows large, providing a
theoretical upper bound on such that the information loss due to the divide
and conquer algorithm is negligible. In other words, the resulting estimators
have the same inferential efficiencies and estimation rates as a practically
infeasible oracle with access to the full sample. Thorough numerical results
are provided to back up the theory
- …