195 research outputs found
Towards Robust Blind Face Restoration with Codebook Lookup Transformer
Blind face restoration is a highly ill-posed problem that often requires
auxiliary guidance to 1) improve the mapping from degraded inputs to desired
outputs, or 2) complement high-quality details lost in the inputs. In this
paper, we demonstrate that a learned discrete codebook prior in a small proxy
space largely reduces the uncertainty and ambiguity of restoration mapping by
casting blind face restoration as a code prediction task, while providing rich
visual atoms for generating high-quality faces. Under this paradigm, we propose
a Transformer-based prediction network, named CodeFormer, to model the global
composition and context of the low-quality faces for code prediction, enabling
the discovery of natural faces that closely approximate the target faces even
when the inputs are severely degraded. To enhance the adaptiveness for
different degradation, we also propose a controllable feature transformation
module that allows a flexible trade-off between fidelity and quality. Thanks to
the expressive codebook prior and global modeling, CodeFormer outperforms the
state of the arts in both quality and fidelity, showing superior robustness to
degradation. Extensive experimental results on synthetic and real-world
datasets verify the effectiveness of our method.Comment: Accepted by NeurIPS 2022. Code: https://github.com/sczhou/CodeForme
Deep Depth Completion of a Single RGB-D Image
The goal of our work is to complete the depth channel of an RGB-D image.
Commodity-grade depth cameras often fail to sense depth for shiny, bright,
transparent, and distant surfaces. To address this problem, we train a deep
network that takes an RGB image as input and predicts dense surface normals and
occlusion boundaries. Those predictions are then combined with raw depth
observations provided by the RGB-D camera to solve for depths for all pixels,
including those missing in the original observation. This method was chosen
over others (e.g., inpainting depths directly) as the result of extensive
experiments with a new depth completion benchmark dataset, where holes are
filled in training data through the rendering of surface reconstructions
created from multiview RGB-D scans. Experiments with different network inputs,
depth representations, loss functions, optimization methods, inpainting
methods, and deep depth estimation networks show that our proposed approach
provides better depth completions than these alternatives.Comment: Accepted by CVPR2018 (Spotlight). Project webpage:
http://deepcompletion.cs.princeton.edu/ This version includes supplementary
materials which provide more implementation details, quantitative evaluation,
and qualitative results. Due to file size limit, please check project website
for high-res pape
PGDiff: Guiding Diffusion Models for Versatile Face Restoration via Partial Guidance
Exploiting pre-trained diffusion models for restoration has recently become a
favored alternative to the traditional task-specific training approach.
Previous works have achieved noteworthy success by limiting the solution space
using explicit degradation models. However, these methods often fall short when
faced with complex degradations as they generally cannot be precisely modeled.
In this paper, we propose PGDiff by introducing partial guidance, a fresh
perspective that is more adaptable to real-world degradations compared to
existing works. Rather than specifically defining the degradation process, our
approach models the desired properties, such as image structure and color
statistics of high-quality images, and applies this guidance during the reverse
diffusion process. These properties are readily available and make no
assumptions about the degradation process. When combined with a diffusion
prior, this partial guidance can deliver appealing results across a range of
restoration tasks. Additionally, PGDiff can be extended to handle composite
tasks by consolidating multiple high-quality image properties, achieved by
integrating the guidance from respective tasks. Experimental results
demonstrate that our method not only outperforms existing diffusion-prior-based
approaches but also competes favorably with task-specific models.Comment: GitHub: https://github.com/pq-yang/PGDif
Blind Face Restoration for Under-Display Camera via Dictionary Guided Transformer
By hiding the front-facing camera below the display panel, Under-Display
Camera (UDC) provides users with a full-screen experience. However, due to the
characteristics of the display, images taken by UDC suffer from significant
quality degradation. Methods have been proposed to tackle UDC image restoration
and advances have been achieved. There are still no specialized methods and
datasets for restoring UDC face images, which may be the most common problem in
the UDC scene. To this end, considering color filtering, brightness
attenuation, and diffraction in the imaging process of UDC, we propose a
two-stage network UDC Degradation Model Network named UDC-DMNet to synthesize
UDC images by modeling the processes of UDC imaging. Then we use UDC-DMNet and
high-quality face images from FFHQ and CelebA-Test to create UDC face training
datasets FFHQ-P/T and testing datasets CelebA-Test-P/T for UDC face
restoration. We propose a novel dictionary-guided transformer network named
DGFormer. Introducing the facial component dictionary and the characteristics
of the UDC image in the restoration makes DGFormer capable of addressing blind
face restoration in UDC scenarios. Experiments show that our DGFormer and
UDC-DMNet achieve state-of-the-art performance
3D Human Face Reconstruction and 2D Appearance Synthesis
3D human face reconstruction has been an extensive research for decades due to its wide applications, such as animation, recognition and 3D-driven appearance synthesis. Although commodity depth sensors are widely available in recent years, image based face reconstruction are significantly valuable as images are much easier to access and store.
In this dissertation, we first propose three image-based face reconstruction approaches according to different assumption of inputs.
In the first approach, face geometry is extracted from multiple key frames of a video sequence with different head poses. The camera should be calibrated under this assumption.
As the first approach is limited to videos, we propose the second approach then focus on single image. This approach also improves the geometry by adding fine grains using shading cue. We proposed a novel albedo estimation and linear optimization algorithm in this approach.
In the third approach, we further loose the constraint of the input image to arbitrary in the wild images. Our proposed approach can robustly reconstruct high quality model even with extreme expressions and large poses.
We then explore the applicability of our face reconstructions on four interesting applications: video face beautification, generating personalized facial blendshape from image sequences, face video stylizing and video face replacement. We demonstrate great potentials of our reconstruction approaches on these real-world applications. In particular, with the recent surge of interests in VR/AR, it is increasingly common to see people wearing head-mounted displays. However, the large occlusion on face is a big obstacle for people to communicate in a face-to-face manner. Our another application is that we explore hardware/software solutions for synthesizing the face image with presence of HMDs. We design two setups (experimental and mobile) which integrate two near IR cameras and one color camera to solve this problem. With our algorithm and prototype, we can achieve photo-realistic results.
We further propose a deep neutral network to solve the HMD removal problem considering it as a face inpainting problem. This approach doesn\u27t need special hardware and run in real-time with satisfying results
Audio-Visual Learning for Scene Understanding
Multimodal deep learning aims at combining the complementary information of different modalities. Among all modalities, audio and video are the predominant ones that humans use to explore the world. In this thesis, we decided to focus our study on audio-visual deep learning to mimic with our networks how humans perceive the world.
Our research includes images, audio signals and acoustic images. The latter provide spatial audio information and are obtained from a planar array of microphones combining their raw audios with the beamforming algorithm. They better mimic human auditory systems, which cannot be replicated using just one microphone, not able alone to give spatial sound cues.
However, as microphones arrays are not so widespread, we also study how to handle the missing spatialized audio modality at test time.
As a solution, we propose to distill acoustic images content to audio features during the training in order to handle their absence at test time. This is done for supervised audio classification using the generalized distillation framework, which we also extend for self-supervised learning.
Next, we devise a method for reconstructing acoustic images given a single microphone and an RGB frame. Therefore, in case we just dispose of a standard video, we are able to synthesize spatial audio, which is useful for many audio-visual tasks, including sound localization.
Lastly, as another example of restoring one modality from available ones, we inpaint degraded images providing audio features, to reconstruct the missing region not only to be visually plausible but also semantically consistent with the related sound. This includes also cross-modal generation, in the limit case of completely missing or hidden visual modality: our method naturally deals with it, being able to generate images from sound.
In summary we show how audio can help visual learning and vice versa, by transferring knowledge between the two modalities at training time, in order to distill, reconstruct, or restore the missing modality at test time
- …