29 research outputs found
Learning to Hallucinate Face Images via Component Generation and Enhancement
We propose a two-stage method for face hallucination. First, we generate
facial components of the input image using CNNs. These components represent the
basic facial structures. Second, we synthesize fine-grained facial structures
from high resolution training images. The details of these structures are
transferred into facial components for enhancement. Therefore, we generate
facial components to approximate ground truth global appearance in the first
stage and enhance them through recovering details in the second stage. The
experiments demonstrate that our method performs favorably against
state-of-the-art methodsComment: IJCAI 2017. Project page:
http://www.cs.cityu.edu.hk/~yibisong/ijcai17_sr/index.htm
Stylizing Face Images via Multiple Exemplars
We address the problem of transferring the style of a headshot photo to face
images. Existing methods using a single exemplar lead to inaccurate results
when the exemplar does not contain sufficient stylized facial components for a
given photo. In this work, we propose an algorithm to stylize face images using
multiple exemplars containing different subjects in the same style. Patch
correspondences between an input photo and multiple exemplars are established
using a Markov Random Field (MRF), which enables accurate local energy transfer
via Laplacian stacks. As image patches from multiple exemplars are used, the
boundaries of facial components on the target image are inevitably
inconsistent. The artifacts are removed by a post-processing step using an
edge-preserving filter. Experimental results show that the proposed algorithm
consistently produces visually pleasing results.Comment: In CVIU 2017. Project Page:
http://www.cs.cityu.edu.hk/~yibisong/cviu17/index.htm
Self-supervised Spatio-temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics
We address the problem of video representation learning without
human-annotated labels. While previous efforts address the problem by designing
novel self-supervised tasks using video data, the learned features are merely
on a frame-by-frame basis, which are not applicable to many video analytic
tasks where spatio-temporal features are prevailing. In this paper we propose a
novel self-supervised approach to learn spatio-temporal features for video
representation. Inspired by the success of two-stream approaches in video
classification, we propose to learn visual features by regressing both motion
and appearance statistics along spatial and temporal dimensions, given only the
input video data. Specifically, we extract statistical concepts (fast-motion
region and the corresponding dominant direction, spatio-temporal color
diversity, dominant color, etc.) from simple patterns in both spatial and
temporal domains. Unlike prior puzzles that are even hard for humans to solve,
the proposed approach is consistent with human inherent visual habits and
therefore easy to answer. We conduct extensive experiments with C3D to validate
the effectiveness of our proposed approach. The experiments show that our
approach can significantly improve the performance of C3D when applied to video
classification tasks. Code is available at
https://github.com/laura-wang/video_repres_mas.Comment: CVPR 201
Self-supervised Video Representation Learning by Uncovering Spatio-temporal Statistics
This paper proposes a novel pretext task to address the self-supervised video
representation learning problem. Specifically, given an unlabeled video clip,
we compute a series of spatio-temporal statistical summaries, such as the
spatial location and dominant direction of the largest motion, the spatial
location and dominant color of the largest color diversity along the temporal
axis, etc. Then a neural network is built and trained to yield the statistical
summaries given the video frames as inputs. In order to alleviate the learning
difficulty, we employ several spatial partitioning patterns to encode rough
spatial locations instead of exact spatial Cartesian coordinates. Our approach
is inspired by the observation that human visual system is sensitive to rapidly
changing contents in the visual field, and only needs impressions about rough
spatial locations to understand the visual contents. To validate the
effectiveness of the proposed approach, we conduct extensive experiments with
four 3D backbone networks, i.e., C3D, 3D-ResNet, R(2+1)D and S3D-G. The results
show that our approach outperforms the existing approaches across these
backbone networks on four downstream video analysis tasks including action
recognition, video retrieval, dynamic scene recognition, and action similarity
labeling. The source code is publicly available at:
https://github.com/laura-wang/video_repres_sts.Comment: Accepted by TPAMI. An extension of our previous work at
arXiv:1904.0359
Laplacian Denoising Autoencoder
While deep neural networks have been shown to perform remarkably well in many
machine learning tasks, labeling a large amount of ground truth data for
supervised training is usually very costly to scale. Therefore, learning robust
representations with unlabeled data is critical in relieving human effort and
vital for many downstream tasks. Recent advances in unsupervised and
self-supervised learning approaches for visual data have benefited greatly from
domain knowledge. Here we are interested in a more generic unsupervised
learning framework that can be easily generalized to other domains. In this
paper, we propose to learn data representations with a novel type of denoising
autoencoder, where the noisy input data is generated by corrupting latent clean
data in the gradient domain. This can be naturally generalized to span multiple
scales with a Laplacian pyramid representation of the input data. In this way,
the agent learns more robust representations that exploit the underlying data
structures across multiple scales. Experiments on several visual benchmarks
demonstrate that better representations can be learned with the proposed
approach, compared to its counterpart with single-scale corruption and other
approaches. Furthermore, we also demonstrate that the learned representations
perform well when transferring to other downstream vision tasks
Joint Face Hallucination and Deblurring via Structure Generation and Detail Enhancement
We address the problem of restoring a high-resolution face image from a
blurry low-resolution input. This problem is difficult as super-resolution and
deblurring need to be tackled simultaneously. Moreover, existing algorithms
cannot handle face images well as low-resolution face images do not have much
texture which is especially critical for deblurring. In this paper, we propose
an effective algorithm by utilizing the domain-specific knowledge of human
faces to recover high-quality faces. We first propose a facial component guided
deep Convolutional Neural Network (CNN) to restore a coarse face image, which
is denoted as the base image where the facial component is automatically
generated from the input face image. However, the CNN based method cannot
handle image details well. We further develop a novel exemplar-based detail
enhancement algorithm via facial component matching. Extensive experiments show
that the proposed method outperforms the state-of-the-art algorithms both
quantitatively and qualitatively.Comment: In IJCV 201
Iterative Reconstruction Based on Latent Diffusion Model for Sparse Data Reconstruction
Reconstructing Computed tomography (CT) images from sparse measurement is a
well-known ill-posed inverse problem. The Iterative Reconstruction (IR)
algorithm is a solution to inverse problems. However, recent IR methods require
paired data and the approximation of the inverse projection matrix. To address
those problems, we present Latent Diffusion Iterative Reconstruction (LDIR), a
pioneering zero-shot method that extends IR with a pre-trained Latent Diffusion
Model (LDM) as a accurate and efficient data prior. By approximating the prior
distribution with an unconditional latent diffusion model, LDIR is the first
method to successfully integrate iterative reconstruction and LDM in an
unsupervised manner. LDIR makes the reconstruction of high-resolution images
more efficient. Moreover, LDIR utilizes the gradient from the data-fidelity
term to guide the sampling process of the LDM, therefore, LDIR does not need
the approximation of the inverse projection matrix and can solve various CT
reconstruction tasks with a single model. Additionally, for enhancing the
sample consistency of the reconstruction, we introduce a novel approach that
uses historical gradient information to guide the gradient. Our experiments on
extremely sparse CT data reconstruction tasks show that LDIR outperforms other
state-of-the-art unsupervised and even exceeds supervised methods, establishing
it as a leading technique in terms of both quantity and quality. Furthermore,
LDIR also achieves competitive performance on nature image tasks. It is worth
noting that LDIR also exhibits significantly faster execution times and lower
memory consumption compared to methods with similar network settings. Our code
will be publicly available