3,014 research outputs found
Spatial Shortcut Network for Human Pose Estimation
Like many computer vision problems, human pose estimation is a challenging
problem in that recognizing a body part requires not only information from
local area but also from areas with large spatial distance. In order to
spatially pass information, large convolutional kernels and deep layers have
been normally used, introducing high computation cost and large parameter
space. Luckily for pose estimation, human body is geometrically structured in
images, enabling modeling of spatial dependency. In this paper, we propose a
spatial shortcut network for pose estimation task, where information is easier
to flow spatially. We evaluate our model with detailed analyses and present its
outstanding performance with smaller structure.Comment: 12 page
CU-Net: Coupled U-Nets
We design a new connectivity pattern for the U-Net architecture. Given
several stacked U-Nets, we couple each U-Net pair through the connections of
their semantic blocks, resulting in the coupled U-Nets (CU-Net). The coupling
connections could make the information flow more efficiently across U-Nets. The
feature reuse across U-Nets makes each U-Net very parameter efficient. We
evaluate the coupled U-Nets on two benchmark datasets of human pose estimation.
Both the accuracy and model parameter number are compared. The CU-Net obtains
comparable accuracy as state-of-the-art methods. However, it only has at least
60% fewer parameters than other approaches.Comment: BMVC 2018 (Oral
NDDR-CNN: Layerwise Feature Fusing in Multi-Task CNNs by Neural Discriminative Dimensionality Reduction
In this paper, we propose a novel Convolutional Neural Network (CNN)
structure for general-purpose multi-task learning (MTL), which enables
automatic feature fusing at every layer from different tasks. This is in
contrast with the most widely used MTL CNN structures which empirically or
heuristically share features on some specific layers (e.g., share all the
features except the last convolutional layer). The proposed layerwise feature
fusing scheme is formulated by combining existing CNN components in a novel
way, with clear mathematical interpretability as discriminative dimensionality
reduction, which is referred to as Neural Discriminative Dimensionality
Reduction (NDDR). Specifically, we first concatenate features with the same
spatial resolution from different tasks according to their channel dimension.
Then, we show that the discriminative dimensionality reduction can be fulfilled
by 1x1 Convolution, Batch Normalization, and Weight Decay in one CNN. The use
of existing CNN components ensures the end-to-end training and the
extensibility of the proposed NDDR layer to various state-of-the-art CNN
architectures in a "plug-and-play" manner. The detailed ablation analysis shows
that the proposed NDDR layer is easy to train and also robust to different
hyperparameters. Experiments on different task sets with various base network
architectures demonstrate the promising performance and desirable
generalizability of our proposed method. The code of our paper is available at
https://github.com/ethanygao/NDDR-CNN.Comment: 11 pages, 3 figures, 9 table
Human Pose Regression by Combining Indirect Part Detection and Contextual Information
In this paper, we propose an end-to-end trainable regression approach for
human pose estimation from still images. We use the proposed Soft-argmax
function to convert feature maps directly to joint coordinates, resulting in a
fully differentiable framework. Our method is able to learn heat maps
representations indirectly, without additional steps of artificial ground truth
generation. Consequently, contextual information can be included to the pose
predictions in a seamless way. We evaluated our method on two very challenging
datasets, the Leeds Sports Poses (LSP) and the MPII Human Pose datasets,
reaching the best performance among all the existing regression methods and
comparable results to the state-of-the-art detection based approaches
Smart Device based Initial Movement Detection of Cyclists using Convolutional Neuronal Networks
For future traffic scenarios, we envision interconnected traffic
participants, who exchange information about their current state, e.g.,
position, their predicted intentions, allowing to act in a cooperative manner.
Vulnerable road users (VRUs), e.g., pedestrians and cyclists, will be equipped
with smart device that can be used to detect their intentions and transmit
these detected intention to approaching cars such that their drivers can be
warned. In this article, we focus on detecting the initial movement of cyclist
using smart devices. Smart devices provide the necessary sensors, namely
accelerometer and gyroscope, and therefore pose an excellent instrument to
detect movement transitions (e.g., waiting to moving) fast. Convolutional
Neural Networks prove to be the state-of-the-art solution for many problems
with an ever increasing range of applications. Therefore, we model the initial
movement detection as a classification problem. In terms of Organic Computing
(OC) it be seen as a step towards self-awareness and self-adaptation. We apply
residual network architectures to the task of detecting the initial starting
movement of cyclists.Comment: 12 pages, accepted for publication at OC-DDC 2018, W\"urzburg,
German
Dynamic Convolutions: Exploiting Spatial Sparsity for Faster Inference
Modern convolutional neural networks apply the same operations on every pixel
in an image. However, not all image regions are equally important. To address
this inefficiency, we propose a method to dynamically apply convolutions
conditioned on the input image. We introduce a residual block where a small
gating branch learns which spatial positions should be evaluated. These
discrete gating decisions are trained end-to-end using the Gumbel-Softmax
trick, in combination with a sparsity criterion. Our experiments on CIFAR,
ImageNet and MPII show that our method has better focus on the region of
interest and better accuracy than existing methods, at a lower computational
complexity. Moreover, we provide an efficient CUDA implementation of our
dynamic convolutions using a gather-scatter approach, achieving a significant
improvement in inference speed with MobileNetV2 residual blocks. On human pose
estimation, a task that is inherently spatially sparse, the processing speed is
increased by 60% with no loss in accuracy.Comment: CVPR 2020 (poster) https://github.com/thomasverelst/dyncon
Generate What You Can't See - a View-dependent Image Generation
In order to operate autonomously, a robot should explore the environment and
build a model of each of the surrounding objects. A common approach is to
carefully scan the whole workspace. This is time-consuming. It is also often
impossible to reach all the viewpoints required to acquire full knowledge about
the environment. Humans can perform shape completion of occluded objects by
relying on past experience. Therefore, we propose a method that generates
images of an object from various viewpoints using a single input RGB image. A
deep neural network is trained to imagine the object appearance from many
viewpoints. We present the whole pipeline, which takes a single RGB image as
input and returns a sequence of RGB and depth images of the object. The method
utilizes a CNN-based object detector to extract the object from the natural
scene. Then, the proposed network generates a set of RGB and depth images. We
show the results both on a synthetic dataset and on real images.Comment: Submitted to IROS 2019. Copyright 2019 IEEE. Personal use of this
material is permitted. Permission from IEEE must be obtained for all other
uses. Supplementary video: https://youtu.be/gCAoJ7BM5F
Residual Codean Autoencoder for Facial Attribute Analysis
Facial attributes can provide rich ancillary information which can be
utilized for different applications such as targeted marketing, human computer
interaction, and law enforcement. This research focuses on facial attribute
prediction using a novel deep learning formulation, termed as R-Codean
autoencoder. The paper first presents Cosine similarity based loss function in
an autoencoder which is then incorporated into the Euclidean distance based
autoencoder to formulate R-Codean. The proposed loss function thus aims to
incorporate both magnitude and direction of image vectors during feature
learning. Further, inspired by the utility of shortcut connections in deep
models to facilitate learning of optimal parameters, without incurring the
problem of vanishing gradient, the proposed formulation is extended to
incorporate shortcut connections in the architecture. The proposed R-Codean
autoencoder is utilized in facial attribute prediction framework which
incorporates patch-based weighting mechanism for assigning higher weights to
relevant patches for each attribute. The experimental results on publicly
available CelebA and LFWA datasets demonstrate the efficacy of the proposed
approach in addressing this challenging problem.Comment: Accepted in Pattern Recognition Letter
A Survey of the Recent Architectures of Deep Convolutional Neural Networks
Deep Convolutional Neural Network (CNN) is a special type of Neural Networks,
which has shown exemplary performance on several competitions related to
Computer Vision and Image Processing. Some of the exciting application areas of
CNN include Image Classification and Segmentation, Object Detection, Video
Processing, Natural Language Processing, and Speech Recognition. The powerful
learning ability of deep CNN is primarily due to the use of multiple feature
extraction stages that can automatically learn representations from the data.
The availability of a large amount of data and improvement in the hardware
technology has accelerated the research in CNNs, and recently interesting deep
CNN architectures have been reported. Several inspiring ideas to bring
advancements in CNNs have been explored, such as the use of different
activation and loss functions, parameter optimization, regularization, and
architectural innovations. However, the significant improvement in the
representational capacity of the deep CNN is achieved through architectural
innovations. Notably, the ideas of exploiting spatial and channel information,
depth and width of architecture, and multi-path information processing have
gained substantial attention. Similarly, the idea of using a block of layers as
a structural unit is also gaining popularity. This survey thus focuses on the
intrinsic taxonomy present in the recently reported deep CNN architectures and,
consequently, classifies the recent innovations in CNN architectures into seven
different categories. These seven categories are based on spatial exploitation,
depth, multi-path, width, feature-map exploitation, channel boosting, and
attention. Additionally, the elementary understanding of CNN components,
current challenges, and applications of CNN are also provided.Comment: Number of Pages: 70, Number of Figures: 11, Number of Tables: 11.
Artif Intell Rev (2020
Dynamic Filter Networks
In a traditional convolutional layer, the learned filters stay fixed after
training. In contrast, we introduce a new framework, the Dynamic Filter
Network, where filters are generated dynamically conditioned on an input. We
show that this architecture is a powerful one, with increased flexibility
thanks to its adaptive nature, yet without an excessive increase in the number
of model parameters. A wide variety of filtering operations can be learned this
way, including local spatial transformations, but also others like selective
(de)blurring or adaptive feature extraction. Moreover, multiple such layers can
be combined, e.g. in a recurrent architecture. We demonstrate the effectiveness
of the dynamic filter network on the tasks of video and stereo prediction, and
reach state-of-the-art performance on the moving MNIST dataset with a much
smaller model. By visualizing the learned filters, we illustrate that the
network has picked up flow information by only looking at unlabelled training
data. This suggests that the network can be used to pretrain networks for
various supervised tasks in an unsupervised way, like optical flow and depth
estimation.Comment: submitted to NIPS16; X. Jia and B. De Brabandere contributed equally
to this work and are listed in alphabetical orde
- …