126 research outputs found
Pix2Vox: Context-aware 3D Reconstruction from Single and Multi-view Images
Recovering the 3D representation of an object from single-view or multi-view
RGB images by deep neural networks has attracted increasing attention in the
past few years. Several mainstream works (e.g., 3D-R2N2) use recurrent neural
networks (RNNs) to fuse multiple feature maps extracted from input images
sequentially. However, when given the same set of input images with different
orders, RNN-based approaches are unable to produce consistent reconstruction
results. Moreover, due to long-term memory loss, RNNs cannot fully exploit
input images to refine reconstruction results. To solve these problems, we
propose a novel framework for single-view and multi-view 3D reconstruction,
named Pix2Vox. By using a well-designed encoder-decoder, it generates a coarse
3D volume from each input image. Then, a context-aware fusion module is
introduced to adaptively select high-quality reconstructions for each part
(e.g., table legs) from different coarse 3D volumes to obtain a fused 3D
volume. Finally, a refiner further refines the fused 3D volume to generate the
final output. Experimental results on the ShapeNet and Pix3D benchmarks
indicate that the proposed Pix2Vox outperforms state-of-the-arts by a large
margin. Furthermore, the proposed method is 24 times faster than 3D-R2N2 in
terms of backward inference time. The experiments on ShapeNet unseen 3D
categories have shown the superior generalization abilities of our method.Comment: ICCV 201
Spatio-Temporal Deformable Attention Network for Video Deblurring
The key success factor of the video deblurring methods is to compensate for
the blurry pixels of the mid-frame with the sharp pixels of the adjacent video
frames. Therefore, mainstream methods align the adjacent frames based on the
estimated optical flows and fuse the alignment frames for restoration. However,
these methods sometimes generate unsatisfactory results because they rarely
consider the blur levels of pixels, which may introduce blurry pixels from
video frames. Actually, not all the pixels in the video frames are sharp and
beneficial for deblurring. To address this problem, we propose the
spatio-temporal deformable attention network (STDANet) for video delurring,
which extracts the information of sharp pixels by considering the pixel-wise
blur levels of the video frames. Specifically, STDANet is an encoder-decoder
network combined with the motion estimator and spatio-temporal deformable
attention (STDA) module, where motion estimator predicts coarse optical flows
that are used as base offsets to find the corresponding sharp pixels in STDA
module. Experimental results indicate that the proposed STDANet performs
favorably against state-of-the-art methods on the GoPro, DVD, and BSD datasets.Comment: ECCV 202
Distinctive action sketch for human action recognition
Recent developments in the field of computer vision have led to a renewed interest in sketch correlated research. There have emerged considerable solid evidence which revealed the significance of sketch. However, there have been few profound discussions on sketch based action analysis so far. In this paper, we propose an approach to discover the most distinctive sketches for action recognition. The action sketches should satisfy two characteristics: sketchability and objectiveness. Primitive sketches are prepared according to the structured forests based fast edge detection. Meanwhile, we take advantage of Faster R-CNN to detect the persons in parallel. On completion of the two stages, the process of distinctive action sketch mining is carried out. After that, we present four kinds of sketch pooling methods to get a uniform representation for action videos. The experimental results show that the proposed method achieves impressive performance against several compared methods on two public datasets.The work was supported in part by the National Science Foundation of China (61472103, 61772158, 61702136, and 61701273) and Australian Research Council (ARC) grant (DP150104645)
SSAH: Semi-supervised Adversarial Deep Hashing with Self-paced Hard Sample Generation
Deep hashing methods have been proved to be effective and efficient for
large-scale Web media search. The success of these data-driven methods largely
depends on collecting sufficient labeled data, which is usually a crucial
limitation in practical cases. The current solutions to this issue utilize
Generative Adversarial Network (GAN) to augment data in semi-supervised
learning. However, existing GAN-based methods treat image generations and
hashing learning as two isolated processes, leading to generation
ineffectiveness. Besides, most works fail to exploit the semantic information
in unlabeled data. In this paper, we propose a novel Semi-supervised Self-pace
Adversarial Hashing method, named SSAH to solve the above problems in a unified
framework. The SSAH method consists of an adversarial network (A-Net) and a
hashing network (H-Net). To improve the quality of generative images, first,
the A-Net learns hard samples with multi-scale occlusions and multi-angle
rotated deformations which compete against the learning of accurate hashing
codes. Second, we design a novel self-paced hard generation policy to gradually
increase the hashing difficulty of generated samples. To make use of the
semantic information in unlabeled ones, we propose a semi-supervised consistent
loss. The experimental results show that our method can significantly improve
state-of-the-art models on both the widely-used hashing datasets and
fine-grained datasets
- …