1,677 research outputs found
Dense xUnit Networks
Deep net architectures have constantly evolved over the past few years,
leading to significant advancements in a wide array of computer vision tasks.
However, besides high accuracy, many applications also require a low
computational load and limited memory footprint. To date, efficiency has
typically been achieved either by architectural choices at the macro level
(e.g. using skip connections or pruning techniques) or modifications at the
level of the individual layers (e.g. using depth-wise convolutions or channel
shuffle operations). Interestingly, much less attention has been devoted to the
role of the activation functions in constructing efficient nets. Recently,
Kligvasser et al. showed that incorporating spatial connections within the
activation functions, enables a significant boost in performance in image
restoration tasks, at any given budget of parameters. However, the
effectiveness of their xUnit module has only been tested on simple small
models, which are not characteristic of those used in high-level vision tasks.
In this paper, we adopt and improve the xUnit activation, show how it can be
incorporated into the DenseNet architecture, and illustrate its high
effectiveness for classification and image restoration tasks alike. While the
DenseNet architecture is extremely efficient to begin with, our dense xUnit net
(DxNet) can typically achieve the same performance with far fewer parameters.
For example, on ImageNet, our DxNet outperforms a ReLU-based DenseNet having
30% more parameters and achieves state-of-the-art results for this budget of
parameters. Furthermore, in denoising and super-resolution, DxNet significantly
improves upon all existing lightweight solutions, including the xUnit-based
nets of Kligvasser et al
NTIRE 2020 Challenge on Image and Video Deblurring
Motion blur is one of the most common degradation artifacts in dynamic scene
photography. This paper reviews the NTIRE 2020 Challenge on Image and Video
Deblurring. In this challenge, we present the evaluation results from 3
competition tracks as well as the proposed solutions. Track 1 aims to develop
single-image deblurring methods focusing on restoration quality. On Track 2,
the image deblurring methods are executed on a mobile platform to find the
balance of the running speed and the restoration accuracy. Track 3 targets
developing video deblurring methods that exploit the temporal relation between
input frames. In each competition, there were 163, 135, and 102 registered
participants and in the final testing phase, 9, 4, and 7 teams competed. The
winning methods demonstrate the state-ofthe-art performance on image and video
deblurring tasks.Comment: To be published in CVPR 2020 Workshop (New Trends in Image
Restoration and Enhancement
CARAFE: Content-Aware ReAssembly of FEatures
Feature upsampling is a key operation in a number of modern convolutional
network architectures, e.g. feature pyramids. Its design is critical for dense
prediction tasks such as object detection and semantic/instance segmentation.
In this work, we propose Content-Aware ReAssembly of FEatures (CARAFE), a
universal, lightweight and highly effective operator to fulfill this goal.
CARAFE has several appealing properties: (1) Large field of view. Unlike
previous works (e.g. bilinear interpolation) that only exploit sub-pixel
neighborhood, CARAFE can aggregate contextual information within a large
receptive field. (2) Content-aware handling. Instead of using a fixed kernel
for all samples (e.g. deconvolution), CARAFE enables instance-specific
content-aware handling, which generates adaptive kernels on-the-fly. (3)
Lightweight and fast to compute. CARAFE introduces little computational
overhead and can be readily integrated into modern network architectures. We
conduct comprehensive evaluations on standard benchmarks in object detection,
instance/semantic segmentation and inpainting. CARAFE shows consistent and
substantial gains across all the tasks (1.2%, 1.3%, 1.8%, 1.1db respectively)
with negligible computational overhead. It has great potential to serve as a
strong building block for future research. It has great potential to serve as a
strong building block for future research. Code and models are available at
https://github.com/open-mmlab/mmdetection.Comment: ICCV 2019 Camera Ready (Oral
NTIRE 2020 Challenge on Real Image Denoising: Dataset, Methods and Results
This paper reviews the NTIRE 2020 challenge on real image denoising with
focus on the newly introduced dataset, the proposed methods and their results.
The challenge is a new version of the previous NTIRE 2019 challenge on real
image denoising that was based on the SIDD benchmark. This challenge is based
on a newly collected validation and testing image datasets, and hence, named
SIDD+. This challenge has two tracks for quantitatively evaluating image
denoising performance in (1) the Bayer-pattern rawRGB and (2) the standard RGB
(sRGB) color spaces. Each track ~250 registered participants. A total of 22
teams, proposing 24 methods, competed in the final phase of the challenge. The
proposed methods by the participating teams represent the current
state-of-the-art performance in image denoising targeting real noisy images.
The newly collected SIDD+ datasets are publicly available at:
https://bit.ly/siddplus_data
Single Image Super-Resolution via Residual Neuron Attention Networks
Deep Convolutional Neural Networks (DCNNs) have achieved impressive
performance in Single Image Super-Resolution (SISR). To further improve the
performance, existing CNN-based methods generally focus on designing deeper
architecture of the network. However, we argue blindly increasing network's
depth is not the most sensible way. In this paper, we propose a novel
end-to-end Residual Neuron Attention Networks (RNAN) for more efficient and
effective SISR. Structurally, our RNAN is a sequential integration of the
well-designed Global Context-enhanced Residual Groups (GCRGs), which extracts
super-resolved features from coarse to fine. Our GCRG is designed with two
novelties. Firstly, the Residual Neuron Attention (RNA) mechanism is proposed
in each block of GCRG to reveal the relevance of neurons for better feature
representation. Furthermore, the Global Context (GC) block is embedded into
RNAN at the end of each GCRG for effectively modeling the global contextual
information. Experiments results demonstrate that our RNAN achieves the
comparable results with state-of-the-art methods in terms of both quantitative
metrics and visual quality, however, with simplified network architecture.Comment: 6 pages, 4 figures, Accepted by IEEE ICIP 202
Spatial Group-wise Enhance: Improving Semantic Feature Learning in Convolutional Networks
The Convolutional Neural Networks (CNNs) generate the feature representation
of complex objects by collecting hierarchical and different parts of semantic
sub-features. These sub-features can usually be distributed in grouped form in
the feature vector of each layer, representing various semantic entities.
However, the activation of these sub-features is often spatially affected by
similar patterns and noisy backgrounds, resulting in erroneous localization and
identification. We propose a Spatial Group-wise Enhance (SGE) module that can
adjust the importance of each sub-feature by generating an attention factor for
each spatial location in each semantic group, so that every individual group
can autonomously enhance its learnt expression and suppress possible noise. The
attention factors are only guided by the similarities between the global and
local feature descriptors inside each group, thus the design of SGE module is
extremely lightweight with \emph{almost no extra parameters and calculations}.
Despite being trained with only category supervisions, the SGE component is
extremely effective in highlighting multiple active areas with various
high-order semantics (such as the dog's eyes, nose, etc.). When integrated with
popular CNN backbones, SGE can significantly boost the performance of image
recognition tasks. Specifically, based on ResNet50 backbones, SGE achieves
1.2\% Top-1 accuracy improvement on the ImageNet benchmark and 1.02.0\%
AP gain on the COCO benchmark across a wide range of detectors
(Faster/Mask/Cascade RCNN and RetinaNet). Codes and pretrained models are
available at https://github.com/implus/PytorchInsight.Comment: Code available at: https://github.com/implus/PytorchInsigh
Adapting Image Super-Resolution State-of-the-arts and Learning Multi-model Ensemble for Video Super-Resolution
Recently, image super-resolution has been widely studied and achieved
significant progress by leveraging the power of deep convolutional neural
networks. However, there has been limited advancement in video super-resolution
(VSR) due to the complex temporal patterns in videos. In this paper, we
investigate how to adapt state-of-the-art methods of image super-resolution for
video super-resolution. The proposed adapting method is straightforward. The
information among successive frames is well exploited, while the overhead on
the original image super-resolution method is negligible. Furthermore, we
propose a learning-based method to ensemble the outputs from multiple
super-resolution models. Our methods show superior performance and rank second
in the NTIRE2019 Video Super-Resolution Challenge Track 1
Learning a Wavelet-like Auto-Encoder to Accelerate Deep Neural Networks
Accelerating deep neural networks (DNNs) has been attracting increasing
attention as it can benefit a wide range of applications, e.g., enabling mobile
systems with limited computing resources to own powerful visual recognition
ability. A practical strategy to this goal usually relies on a two-stage
process: operating on the trained DNNs (e.g., approximating the convolutional
filters with tensor decomposition) and fine-tuning the amended network, leading
to difficulty in balancing the trade-off between acceleration and maintaining
recognition performance. In this work, aiming at a general and comprehensive
way for neural network acceleration, we develop a Wavelet-like Auto-Encoder
(WAE) that decomposes the original input image into two low-resolution channels
(sub-images) and incorporate the WAE into the classification neural networks
for joint training. The two decomposed channels, in particular, are encoded to
carry the low-frequency information (e.g., image profiles) and high-frequency
(e.g., image details or noises), respectively, and enable reconstructing the
original input image through the decoding process. Then, we feed the
low-frequency channel into a standard classification network such as VGG or
ResNet and employ a very lightweight network to fuse with the high-frequency
channel to obtain the classification result. Compared to existing DNN
acceleration solutions, our framework has the following advantages: i) it is
tolerant to any existing convolutional neural networks for classification
without amending their structures; ii) the WAE provides an interpretable way to
preserve the main components of the input image for classification.Comment: Accepted at AAAI 201
HR-NAS: Searching Efficient High-Resolution Neural Architectures with Lightweight Transformers
High-resolution representations (HR) are essential for dense prediction tasks
such as segmentation, detection, and pose estimation. Learning HR
representations is typically ignored in previous Neural Architecture Search
(NAS) methods that focus on image classification. This work proposes a novel
NAS method, called HR-NAS, which is able to find efficient and accurate
networks for different tasks, by effectively encoding multiscale contextual
information while maintaining high-resolution representations. In HR-NAS, we
renovate the NAS search space as well as its searching strategy. To better
encode multiscale image contexts in the search space of HR-NAS, we first
carefully design a lightweight transformer, whose computational complexity can
be dynamically changed with respect to different objective functions and
computation budgets. To maintain high-resolution representations of the learned
networks, HR-NAS adopts a multi-branch architecture that provides convolutional
encoding of multiple feature resolutions, inspired by HRNet. Last, we proposed
an efficient fine-grained search strategy to train HR-NAS, which effectively
explores the search space, and finds optimal architectures given various tasks
and computation resources. HR-NAS is capable of achieving state-of-the-art
trade-offs between performance and FLOPs for three dense prediction tasks and
an image classification task, given only small computational budgets. For
example, HR-NAS surpasses SqueezeNAS that is specially designed for semantic
segmentation while improving efficiency by 45.9%. Code is available at
https://github.com/dingmyu/HR-NASComment: Accepted by CVPR 2021 (Oral
Zoom-In-to-Check: Boosting Video Interpolation via Instance-level Discrimination
We propose a light-weight video frame interpolation algorithm. Our key
innovation is an instance-level supervision that allows information to be
learned from the high-resolution version of similar objects. Our experiment
shows that the proposed method can generate state-of-the-art results across
different datasets, with fractional computation resources (time and memory) of
competing methods. Given two image frames, a cascade network creates an
intermediate frame with 1) a flow-warping module that computes coarse
bi-directional optical flow and creates an interpolated image via flow-based
warping, followed by 2) an image synthesis module to make fine-scale
corrections. In the learning stage, object detection proposals are generated on
the interpolated image.Lower resolution objects are zoomed into, and the
learning algorithms using an adversarial loss trained on high-resolution
objects to guide the system towards the instance-level refinement corrects
details of object shape and boundaries.Comment: CVPR 2019 camera-ready, supplementary video:
https://youtu.be/q-_wIRq26D
- …