558 research outputs found
Adapting Image Super-Resolution State-of-the-arts and Learning Multi-model Ensemble for Video Super-Resolution
Recently, image super-resolution has been widely studied and achieved
significant progress by leveraging the power of deep convolutional neural
networks. However, there has been limited advancement in video super-resolution
(VSR) due to the complex temporal patterns in videos. In this paper, we
investigate how to adapt state-of-the-art methods of image super-resolution for
video super-resolution. The proposed adapting method is straightforward. The
information among successive frames is well exploited, while the overhead on
the original image super-resolution method is negligible. Furthermore, we
propose a learning-based method to ensemble the outputs from multiple
super-resolution models. Our methods show superior performance and rank second
in the NTIRE2019 Video Super-Resolution Challenge Track 1
RepGN:Object Detection with Relational Proposal Graph Network
Region based object detectors achieve the state-of-the-art performance, but
few consider to model the relation of proposals. In this paper, we explore the
idea of modeling the relationships among the proposals for object detection
from the graph learning perspective. Specifically, we present relational
proposal graph network (RepGN) which is defined on object proposals and the
semantic and spatial relation modeled as the edge. By integrating our RepGN
module into object detectors, the relation and context constraints will be
introduced to the feature extraction of regions and bounding boxes regression
and classification. Besides, we propose a novel graph-cut based pooling layer
for hierarchical coarsening of the graph, which empowers the RepGN module to
exploit the inter-regional correlation and scene description in a hierarchical
manner. We perform extensive experiments on COCO object detection dataset and
show promising results
CompactNet: Platform-Aware Automatic Optimization for Convolutional Neural Networks
Convolutional Neural Network (CNN) based Deep Learning (DL) has achieved
great progress in many real-life applications. Meanwhile, due to the complex
model structures against strict latency and memory restriction, the
implementation of CNN models on the resource-limited platforms is becoming more
challenging. This work proposes a solution, called CompactNet\footnote{Project
URL: \url{https://github.com/CompactNet/CompactNet}}, which automatically
optimizes a pre-trained CNN model on a specific resource-limited platform given
a specific target of inference speedup. Guided by a simulator of the target
platform, CompactNet progressively trims a pre-trained network by removing
certain redundant filters until the target speedup is reached and generates an
optimal platform-specific model while maintaining the accuracy. We evaluate our
work on two platforms of a mobile ARM CPU and a machine learning accelerator
NPU (Cambricon-1A ISA) on a Huawei Mate10 smartphone. For the state-of-the-art
slim CNN model made for the embedded platform, MobileNetV2, CompactNet achieves
up to a 1.8x kernel computation speedup with equal or even higher accuracy for
image classification tasks on the Cifar-10 dataset
Multi-Scale Dual-Branch Fully Convolutional Network for Hand Parsing
Recently, fully convolutional neural networks (FCNs) have shown significant
performance in image parsing, including scene parsing and object parsing.
Different from generic object parsing tasks, hand parsing is more challenging
due to small size, complex structure, heavy self-occlusion and ambiguous
texture problems. In this paper, we propose a novel parsing framework,
Multi-Scale Dual-Branch Fully Convolutional Network (MSDB-FCN), for hand
parsing tasks. Our network employs a Dual-Branch architecture to extract
features of hand area, paying attention on the hand itself. These features are
used to generate multi-scale features with pyramid pooling strategy. In order
to better encode multi-scale features, we design a Deconvolution and Bilinear
Interpolation Block (DB-Block) for upsampling and merging the features of
different scales. To address data imbalance, which is a common problem in many
computer vision tasks as well as hand parsing tasks, we propose a
generalization of Focal Loss, namely Multi-Class Balanced Focal Loss, to tackle
data imbalance in multi-class classification. Extensive experiments on
RHD-PARSING dataset demonstrate that our MSDB-FCN has achieved the
state-of-the-art performance for hand parsing
Learning Spatiotemporal Features via Video and Text Pair Discrimination
Current video representations heavily rely on learning from manually
annotated video datasets which are time-consuming and expensive to acquire. We
observe videos are naturally accompanied by abundant text information such as
YouTube titles and Instagram captions. In this paper, we leverage this
visual-textual connection to learn spatiotemporal features in an efficient
weakly-supervised manner. We present a general cross-modal pair discrimination
(CPD) framework to capture this correlation between a video and its associated
text. Specifically, we adopt noise-contrastive estimation to tackle the
computational issue imposed by the huge amount of pair instance classes and
design a practical curriculum learning strategy. We train our CPD models on
both standard video dataset (Kinetics-210k) and uncurated web video dataset
(Instagram-300k) to demonstrate its effectiveness. Without further fine-tuning,
the learnt models obtain competitive results for action classification on
Kinetics under the linear classification protocol. Moreover, our visual model
provides an effective initialization to fine-tune on downstream tasks, which
yields a remarkable performance gain for action recognition on UCF101 and
HMDB51, compared with the existing state-of-the-art self-supervised training
methods. In addition, our CPD model yields a new state of the art for zero-shot
action recognition on UCF101 by directly utilizing the learnt visual-textual
embeddings. The code will be made available at
https://github.com/MCG-NJU/CPD-Video.Comment: Technical Repor
Progressive Stochastic Binarization of Deep Networks
A plethora of recent research has focused on improving the memory footprint
and inference speed of deep networks by reducing the complexity of (i)
numerical representations (for example, by deterministic or stochastic
quantization) and (ii) arithmetic operations (for example, by binarization of
weights).
We propose a stochastic binarization scheme for deep networks that allows for
efficient inference on hardware by restricting itself to additions of small
integers and fixed shifts. Unlike previous approaches, the underlying
randomized approximation is progressive, thus permitting an adaptive control of
the accuracy of each operation at run-time. In a low-precision setting, we
match the accuracy of previous binarized approaches. Our representation is
unbiased - it approaches continuous computation with increasing sample size. In
a high-precision regime, the computational costs are competitive with previous
quantization schemes. Progressive stochastic binarization also permits
localized, dynamic accuracy control within a single network, thereby providing
a new tool for adaptively focusing computational attention.
We evaluate our method on networks of various architectures, already
pretrained on ImageNet. With representational costs comparable to previous
schemes, we obtain accuracies close to the original floating point
implementation. This includes pruned networks, except the known special case of
certain types of separated convolutions. By focusing computational attention
using progressive sampling, we reduce inference costs on ImageNet further by a
factor of up to 33% (before network pruning)
Learning Gaussian Instance Segmentation in Point Clouds
This paper presents a novel method for instance segmentation of 3D point
clouds. The proposed method is called Gaussian Instance Center Network (GICN),
which can approximate the distributions of instance centers scattered in the
whole scene as Gaussian center heatmaps. Based on the predicted heatmaps, a
small number of center candidates can be easily selected for the subsequent
predictions with efficiency, including i) predicting the instance size of each
center to decide a range for extracting features, ii) generating bounding boxes
for centers, and iii) producing the final instance masks. GICN is a
single-stage, anchor-free, and end-to-end architecture that is easy to train
and efficient to perform inference. Benefited from the center-dictated
mechanism with adaptive instance size selection, our method achieves
state-of-the-art performance in the task of 3D instance segmentation on ScanNet
and S3DIS datasets
Towards More Efficient and Effective Inference: The Joint Decision of Multi-Participants
Existing approaches to improve the performances of convolutional neural
networks by optimizing the local architectures or deepening the networks tend
to increase the size of models significantly. In order to deploy and apply the
neural networks to edge devices which are in great demand, reducing the scale
of networks are quite crucial. However, It is easy to degrade the performance
of image processing by compressing the networks. In this paper, we propose a
method which is suitable for edge devices while improving the efficiency and
effectiveness of inference. The joint decision of multi-participants, mainly
contain multi-layers and multi-networks, can achieve higher classification
accuracy (0.26% on CIFAR-10 and 4.49% on CIFAR-100 at most) with similar total
number of parameters for classical convolutional neural networks
Translate the Facial Regions You Like Using Region-Wise Normalization
Though GAN (Generative Adversarial Networks) based technique has greatly
advanced the performance of image synthesis and face translation, only few
works available in literature provide region based style encoding and
translation. We propose in this paper a region-wise normalization framework,
for region level face translation. While per-region style is encoded using
available approach, we build a so called RIN (region-wise normalization) block
to individually inject the styles into per-region feature maps and then fuse
them for following convolution and upsampling. Both shape and texture of
different regions can thus be translated to various target styles. A region
matching loss has also been proposed to significantly reduce the inference
between regions during the translation process. Extensive experiments on three
publicly available datasets, i.e. Morph, RaFD and CelebAMask-HQ, suggest that
our approach demonstrate a large improvement over state-of-the-art methods like
StarGAN, SEAN and FUNIT. Our approach has further advantages in precise control
of the regions to be translated. As a result, region level expression changes
and step by step make up can be achieved. The video demo is available at
https://youtu.be/ceRqsbzXAfk.Comment: 13 pages, 13 figure
Adaptive Exploration for Unsupervised Person Re-Identification
Due to domain bias, directly deploying a deep person re-identification
(re-ID) model trained on one dataset often achieves considerably poor accuracy
on another dataset. In this paper, we propose an Adaptive Exploration (AE)
method to address the domain-shift problem for re-ID in an unsupervised manner.
Specifically, in the target domain, the re-ID model is inducted to 1) maximize
distances between all person images and 2) minimize distances between similar
person images. In the first case, by treating each person image as an
individual class, a non-parametric classifier with a feature memory is
exploited to encourage person images to move far away from each other. In the
second case, according to a similarity threshold, our method adaptively selects
neighborhoods for each person image in the feature space. By treating these
similar person images as the same class, the non-parametric classifier forces
them to stay closer. However, a problem of the adaptive selection is that, when
an image has too many neighborhoods, it is more likely to attract other images
as its neighborhoods. As a result, a minority of images may select a large
number of neighborhoods while a majority of images have only a few
neighborhoods. To address this issue, we additionally integrate a balance
strategy into the adaptive selection. We evaluate our methods with two
protocols. The first one is called "target-only re-ID", in which only the
unlabeled target data is used for training. The second one is called "domain
adaptive re-ID", in which both the source data and the target data are used
during training. Experimental results on large-scale re-ID datasets demonstrate
the effectiveness of our method. Our code has been released at
https://github.com/dyh127/Adaptive-Exploration-for-Unsupervised-Person-Re-Identification.Comment: ACM Transactions on Multimedia Computing, Communications and
Application (TOMCCAP
- …