145 research outputs found
Learning Joint Spatial-Temporal Transformations for Video Inpainting
High-quality video inpainting that completes missing regions in video frames
is a promising yet challenging task. State-of-the-art approaches adopt
attention models to complete a frame by searching missing contents from
reference frames, and further complete whole videos frame by frame. However,
these approaches can suffer from inconsistent attention results along spatial
and temporal dimensions, which often leads to blurriness and temporal artifacts
in videos. In this paper, we propose to learn a joint Spatial-Temporal
Transformer Network (STTN) for video inpainting. Specifically, we
simultaneously fill missing regions in all input frames by self-attention, and
propose to optimize STTN by a spatial-temporal adversarial loss. To show the
superiority of the proposed model, we conduct both quantitative and qualitative
evaluations by using standard stationary masks and more realistic moving object
masks. Demo videos are available at https://github.com/researchmm/STTN.Comment: Accepted by ECCV202
DeepDR: Deep Structure-Aware RGB-D Inpainting for Diminished Reality
Diminished reality (DR) refers to the removal of real objects from the
environment by virtually replacing them with their background. Modern DR
frameworks use inpainting to hallucinate unobserved regions. While recent deep
learning-based inpainting is promising, the DR use case is complicated by the
need to generate coherent structure and 3D geometry (i.e., depth), in
particular for advanced applications, such as 3D scene editing. In this paper,
we propose DeepDR, a first RGB-D inpainting framework fulfilling all
requirements of DR: Plausible image and geometry inpainting with coherent
structure, running at real-time frame rates, with minimal temporal artifacts.
Our structure-aware generative network allows us to explicitly condition color
and depth outputs on the scene semantics, overcoming the difficulty of
reconstructing sharp and consistent boundaries in regions with complex
backgrounds. Experimental results show that the proposed framework can
outperform related work qualitatively and quantitatively.Comment: 11 pages, 8 figures + 13 pages, 10 figures supplementary. Accepted at
3DV 202
FlowLens: Seeing Beyond the FoV via Flow-guided Clip-Recurrent Transformer
Limited by hardware cost and system size, camera's Field-of-View (FoV) is not
always satisfactory. However, from a spatio-temporal perspective, information
beyond the camera's physical FoV is off-the-shelf and can actually be obtained
"for free" from the past. In this paper, we propose a novel task termed
Beyond-FoV Estimation, aiming to exploit past visual cues and bidirectional
break through the physical FoV of a camera. We put forward a FlowLens
architecture to expand the FoV by achieving feature propagation explicitly by
optical flow and implicitly by a novel clip-recurrent transformer, which has
two appealing features: 1) FlowLens comprises a newly proposed Clip-Recurrent
Hub with 3D-Decoupled Cross Attention (DDCA) to progressively process global
information accumulated in the temporal dimension. 2) A multi-branch Mix Fusion
Feed Forward Network (MixF3N) is integrated to enhance the spatially-precise
flow of local features. To foster training and evaluation, we establish
KITTI360-EX, a dataset for outer- and inner FoV expansion. Extensive
experiments on both video inpainting and beyond-FoV estimation tasks show that
FlowLens achieves state-of-the-art performance. Code will be made publicly
available at https://github.com/MasterHow/FlowLens.Comment: Code will be made publicly available at
https://github.com/MasterHow/FlowLen
Zoom-to-Inpaint: Image Inpainting with High-Frequency Details
Although deep learning has enabled a huge leap forward in image inpainting,
current methods are often unable to synthesize realistic high-frequency
details. In this paper, we propose applying super-resolution to coarsely
reconstructed outputs, refining them at high resolution, and then downscaling
the output to the original resolution. By introducing high-resolution images to
the refinement network, our framework is able to reconstruct finer details that
are usually smoothed out due to spectral bias - the tendency of neural networks
to reconstruct low frequencies better than high frequencies. To assist training
the refinement network on large upscaled holes, we propose a progressive
learning technique in which the size of the missing regions increases as
training progresses. Our zoom-in, refine and zoom-out strategy, combined with
high-resolution supervision and progressive learning, constitutes a
framework-agnostic approach for enhancing high-frequency details that can be
applied to any CNN-based inpainting method. We provide qualitative and
quantitative evaluations along with an ablation analysis to show the
effectiveness of our approach. This seemingly simple, yet powerful approach,
outperforms state-of-the-art inpainting methods
- …