28 research outputs found
Object Detection in 20 Years: A Survey
Object detection, as of one the most fundamental and challenging problems in
computer vision, has received great attention in recent years. Its development
in the past two decades can be regarded as an epitome of computer vision
history. If we think of today's object detection as a technical aesthetics
under the power of deep learning, then turning back the clock 20 years we would
witness the wisdom of cold weapon era. This paper extensively reviews 400+
papers of object detection in the light of its technical evolution, spanning
over a quarter-century's time (from the 1990s to 2019). A number of topics have
been covered in this paper, including the milestone detectors in history,
detection datasets, metrics, fundamental building blocks of the detection
system, speed up techniques, and the recent state of the art detection methods.
This paper also reviews some important detection applications, such as
pedestrian detection, face detection, text detection, etc, and makes an in-deep
analysis of their challenges as well as technical improvements in recent years.Comment: This work has been submitted to the IEEE TPAMI for possible
publicatio
High-Quality 3D Face Reconstruction with Affine Convolutional Networks
Recent works based on convolutional encoder-decoder architecture and 3DMM
parameterization have shown great potential for canonical view reconstruction
from a single input image. Conventional CNN architectures benefit from
exploiting the spatial correspondence between the input and output pixels.
However, in 3D face reconstruction, the spatial misalignment between the input
image (e.g. face) and the canonical/UV output makes the feature
encoding-decoding process quite challenging. In this paper, to tackle this
problem, we propose a new network architecture, namely the Affine Convolution
Networks, which enables CNN based approaches to handle spatially
non-corresponding input and output images and maintain high-fidelity quality
output at the same time. In our method, an affine transformation matrix is
learned from the affine convolution layer for each spatial location of the
feature maps. In addition, we represent 3D human heads in UV space with
multiple components, including diffuse maps for texture representation,
position maps for geometry representation, and light maps for recovering more
complex lighting conditions in the real world. All the components can be
trained without any manual annotations. Our method is parametric-free and can
generate high-quality UV maps at resolution of 512 x 512 pixels, while previous
approaches normally generate 256 x 256 pixels or smaller. Our code will be
released once the paper got accepted.Comment: 9 pages, 11 figure
Dense Pixel-to-Pixel Harmonization via Continuous Image Representation
High-resolution (HR) image harmonization is of great significance in
real-world applications such as image synthesis and image editing. However, due
to the high memory costs, existing dense pixel-to-pixel harmonization methods
are mainly focusing on processing low-resolution (LR) images. Some recent works
resort to combining with color-to-color transformations but are either limited
to certain resolutions or heavily depend on hand-crafted image filters. In this
work, we explore leveraging the implicit neural representation (INR) and
propose a novel image Harmonization method based on Implicit neural Networks
(HINet), which to the best of our knowledge, is the first dense pixel-to-pixel
method applicable to HR images without any hand-crafted filter design. Inspired
by the Retinex theory, we decouple the MLPs into two parts to respectively
capture the content and environment of composite images. A Low-Resolution Image
Prior (LRIP) network is designed to alleviate the Boundary Inconsistency
problem, and we also propose new designs for the training and inference
process. Extensive experiments have demonstrated the effectiveness of our
method compared with state-of-the-art methods. Furthermore, some interesting
and practical applications of the proposed method are explored. Our code is
available at https://github.com/WindVChen/INR-Harmonization.Comment: Accepted by IEEE Transactions on Circuits and Systems for Video
Technology (TCSVT
Implicit Ray-Transformers for Multi-view Remote Sensing Image Segmentation
The mainstream CNN-based remote sensing (RS) image semantic segmentation
approaches typically rely on massive labeled training data. Such a paradigm
struggles with the problem of RS multi-view scene segmentation with limited
labeled views due to the lack of considering 3D information within the scene.
In this paper, we propose ''Implicit Ray-Transformer (IRT)'' based on Implicit
Neural Representation (INR), for RS scene semantic segmentation with sparse
labels (such as 4-6 labels per 100 images). We explore a new way of introducing
multi-view 3D structure priors to the task for accurate and view-consistent
semantic segmentation. The proposed method includes a two-stage learning
process. In the first stage, we optimize a neural field to encode the color and
3D structure of the remote sensing scene based on multi-view images. In the
second stage, we design a Ray Transformer to leverage the relations between the
neural field 3D features and 2D texture features for learning better semantic
representations. Different from previous methods that only consider 3D prior or
2D features, we incorporate additional 2D texture information and 3D prior by
broadcasting CNN features to different point features along the sampled ray. To
verify the effectiveness of the proposed method, we construct a challenging
dataset containing six synthetic sub-datasets collected from the Carla platform
and three real sub-datasets from Google Maps. Experiments show that the
proposed method outperforms the CNN-based methods and the state-of-the-art
INR-based segmentation methods in quantitative and qualitative metrics
Continuous Cross-resolution Remote Sensing Image Change Detection
Most contemporary supervised Remote Sensing (RS) image Change Detection (CD)
approaches are customized for equal-resolution bitemporal images. Real-world
applications raise the need for cross-resolution change detection, aka, CD
based on bitemporal images with different spatial resolutions. Given training
samples of a fixed bitemporal resolution difference (ratio) between the
high-resolution (HR) image and the low-resolution (LR) one, current
cross-resolution methods may fit a certain ratio but lack adaptation to other
resolution differences. Toward continuous cross-resolution CD, we propose
scale-invariant learning to enforce the model consistently predicting HR
results given synthesized samples of varying resolution differences.
Concretely, we synthesize blurred versions of the HR image by random
downsampled reconstructions to reduce the gap between HR and LR images. We
introduce coordinate-based representations to decode per-pixel predictions by
feeding the coordinate query and corresponding multi-level embedding features
into an MLP that implicitly learns the shape of land cover changes, therefore
benefiting recognizing blurred objects in the LR image. Moreover, considering
that spatial resolution mainly affects the local textures, we apply
local-window self-attention to align bitemporal features during the early
stages of the encoder. Extensive experiments on two synthesized and one
real-world different-resolution CD datasets verify the effectiveness of the
proposed method. Our method significantly outperforms several vanilla CD
methods and two cross-resolution CD methods on the three datasets both in
in-distribution and out-of-distribution settings. The empirical results suggest
that our method could yield relatively consistent HR change predictions
regardless of varying bitemporal resolution ratios. Our code is available at
\url{https://github.com/justchenhao/SILI_CD}.Comment: 21 pages, 11 figures. Accepted article by IEEE TGR
Continuous Remote Sensing Image Super-Resolution based on Context Interaction in Implicit Function Space
Despite its fruitful applications in remote sensing, image super-resolution
is troublesome to train and deploy as it handles different resolution
magnifications with separate models. Accordingly, we propose a
highly-applicable super-resolution framework called FunSR, which settles
different magnifications with a unified model by exploiting context interaction
within implicit function space. FunSR composes a functional representor, a
functional interactor, and a functional parser. Specifically, the representor
transforms the low-resolution image from Euclidean space to multi-scale
pixel-wise function maps; the interactor enables pixel-wise function expression
with global dependencies; and the parser, which is parameterized by the
interactor's output, converts the discrete coordinates with additional
attributes to RGB values. Extensive experimental results demonstrate that FunSR
reports state-of-the-art performance on both fixed-magnification and
continuous-magnification settings, meanwhile, it provides many friendly
applications thanks to its unified nature