1,912 research outputs found
Single Stage Virtual Try-on via Deformable Attention Flows
Virtual try-on aims to generate a photo-realistic fitting result given an
in-shop garment and a reference person image. Existing methods usually build up
multi-stage frameworks to deal with clothes warping and body blending
respectively, or rely heavily on intermediate parser-based labels which may be
noisy or even inaccurate. To solve the above challenges, we propose a
single-stage try-on framework by developing a novel Deformable Attention Flow
(DAFlow), which applies the deformable attention scheme to multi-flow
estimation. With pose keypoints as the guidance only, the self- and
cross-deformable attention flows are estimated for the reference person and the
garment images, respectively. By sampling multiple flow fields, the
feature-level and pixel-level information from different semantic areas are
simultaneously extracted and merged through the attention mechanism. It enables
clothes warping and body synthesizing at the same time which leads to
photo-realistic results in an end-to-end manner. Extensive experiments on two
try-on datasets demonstrate that our proposed method achieves state-of-the-art
performance both qualitatively and quantitatively. Furthermore, additional
experiments on the other two image editing tasks illustrate the versatility of
our method for multi-view synthesis and image animation.Comment: ECCV 202
Semantic-aware Network for Aerial-to-Ground Image Synthesis
Aerial-to-ground image synthesis is an emerging and challenging problem that
aims to synthesize a ground image from an aerial image. Due to the highly
different layout and object representation between the aerial and ground
images, existing approaches usually fail to transfer the components of the
aerial scene into the ground scene. In this paper, we propose a novel framework
to explore the challenges by imposing enhanced structural alignment and
semantic awareness. We introduce a novel semantic-attentive feature
transformation module that allows to reconstruct the complex geographic
structures by aligning the aerial feature to the ground layout. Furthermore, we
propose semantic-aware loss functions by leveraging a pre-trained segmentation
network. The network is enforced to synthesize realistic objects across various
classes by separately calculating losses for different classes and balancing
them. Extensive experiments including comparisons with previous methods and
ablation studies show the effectiveness of the proposed framework both
qualitatively and quantitatively.Comment: ICIP 2021. Code is available at https://github.com/jinhyunj/SANe
2D Object Detection with Transformers: A Review
Astounding performance of Transformers in natural language processing (NLP)
has delighted researchers to explore their utilization in computer vision
tasks. Like other computer vision tasks, DEtection TRansformer (DETR)
introduces transformers for object detection tasks by considering the detection
as a set prediction problem without needing proposal generation and
post-processing steps. It is a state-of-the-art (SOTA) method for object
detection, particularly in scenarios where the number of objects in an image is
relatively small. Despite the success of DETR, it suffers from slow training
convergence and performance drops for small objects. Therefore, many
improvements are proposed to address these issues, leading to immense
refinement in DETR. Since 2020, transformer-based object detection has
attracted increasing interest and demonstrated impressive performance. Although
numerous surveys have been conducted on transformers in vision in general, a
review regarding advancements made in 2D object detection using transformers is
still missing. This paper gives a detailed review of twenty-one papers about
recent developments in DETR. We begin with the basic modules of Transformers,
such as self-attention, object queries and input features encoding. Then, we
cover the latest advancements in DETR, including backbone modification, query
design and attention refinement. We also compare all detection transformers in
terms of performance and network design. We hope this study will increase the
researcher's interest in solving existing challenges towards applying
transformers in the object detection domain. Researchers can follow newer
improvements in detection transformers on this webpage available at:
https://github.com/mindgarage-shan/trans_object_detection_surve
Deep learning methods for 360 monocular depth estimation and point cloud semantic segmentation
Monocular depth estimation and point cloud segmentation are essential tasks for 3D scene understanding in computer vision. Depth estimation for omnidirectional images is challenging due to the spherical distortion issue and the availability of large-scale labeled datasets. We propose two separate works for 360 monocular depth estimation tasks. In the first work, we propose a novel, model-agnostic, two-stage pipeline for omnidirectional monocular depth estimation. Our proposed framework PanoDepth takes one 360 image as input, produces one or more synthesized views in the first stage, and feeds the original image and the synthesized images into the subsequent stereo matching stage. Utilizing the explicit stereo-based geometric constraints, PanoDepth can generate dense high-quality depth. In the second work, we propose a 360 monocular depth estimation pipeline, OmniFusion, to tackle the spherical distortion issue. Our pipeline transforms a 360 image into less-distorted perspective patches (i.e. tangent images) to obtain patch-wise predictions via CNN, and then merge the patch-wise results for final output. To handle the discrepancy between patch-wise predictions which is a major issue affecting the merging quality, we propose a new framework with (i) a geometry-aware feature fusion mechanism that combines 3D geometric features with 2D image features. (ii) the self-attention-based transformer architecture to conduct a global aggregation of patch-wise information. (iii) an iterative depth refinement mechanism to further refine the estimated depth based on the more accurate geometric features. Experiments show that both PanoDepth and OmniFusion achieve state-of-the-art performances on several 360 monocular depth estimation benchmark datasets. For point cloud analysis, we mainly focus on defining effective local point convolution operators. We propose two approaches, SPNet and Point-Voxel CNN respectively. For the former, we propose a novel point convolution operator named Shell Point Convolution (SPConv) as the building block for shape encoding and local context learning. Specifically, SPConv splits 3D neighborhood space into shells, aggregates local features on manually designed kernel points, and performs convolution on the shells. For the latter, we present a novel lightweight convolutional neural network which uses point voxel convolution (PVC) layer as building block. Each PVC layer has two parallel branches, namely the voxel branch and the point branch. For the voxel branch, we aggregate local features on non-empty voxel centers to reduce geometric information loss caused by voxelization, then apply volumetric convolutions to enhance local neighborhood geometry encoding. For the point branch, we use Multi-Layer Perceptron (MLP) to extract fine-detailed point-wise features. Outputs from these two branches are adaptively fused via a feature selection module. Experimental results show that SPConv and PVC layers are effective in local shape encoding, and our proposed networks perform well in semantic segmentation tasks.Includes bibliographical references
Recent Progress in Transformer-based Medical Image Analysis
The transformer is primarily used in the field of natural language
processing. Recently, it has been adopted and shows promise in the computer
vision (CV) field. Medical image analysis (MIA), as a critical branch of CV,
also greatly benefits from this state-of-the-art technique. In this review, we
first recap the core component of the transformer, the attention mechanism, and
the detailed structures of the transformer. After that, we depict the recent
progress of the transformer in the field of MIA. We organize the applications
in a sequence of different tasks, including classification, segmentation,
captioning, registration, detection, enhancement, localization, and synthesis.
The mainstream classification and segmentation tasks are further divided into
eleven medical image modalities. A large number of experiments studied in this
review illustrate that the transformer-based method outperforms existing
methods through comparisons with multiple evaluation metrics. Finally, we
discuss the open challenges and future opportunities in this field. This
task-modality review with the latest contents, detailed information, and
comprehensive comparison may greatly benefit the broad MIA community.Comment: Computers in Biology and Medicine Accepte
TTVFI: Learning Trajectory-Aware Transformer for Video Frame Interpolation
Video frame interpolation (VFI) aims to synthesize an intermediate frame
between two consecutive frames. State-of-the-art approaches usually adopt a
two-step solution, which includes 1) generating locally-warped pixels by
flow-based motion estimations, 2) blending the warped pixels to form a full
frame through deep neural synthesis networks. However, due to the inconsistent
warping from the two consecutive frames, the warped features for new frames are
usually not aligned, which leads to distorted and blurred frames, especially
when large and complex motions occur. To solve this issue, in this paper we
propose a novel Trajectory-aware Transformer for Video Frame Interpolation
(TTVFI). In particular, we formulate the warped features with inconsistent
motions as query tokens, and formulate relevant regions in a motion trajectory
from two original consecutive frames into keys and values. Self-attention is
learned on relevant tokens along the trajectory to blend the pristine features
into intermediate frames through end-to-end training. Experimental results
demonstrate that our method outperforms other state-of-the-art methods in four
widely-used VFI benchmarks. Both code and pre-trained models will be released
soon
- …