5,564 research outputs found
Joint Maximum Purity Forest with Application to Image Super-Resolution
In this paper, we propose a novel random-forest scheme, namely Joint Maximum
Purity Forest (JMPF), for classification, clustering, and regression tasks. In
the JMPF scheme, the original feature space is transformed into a compactly
pre-clustered feature space, via a trained rotation matrix. The rotation matrix
is obtained through an iterative quantization process, where the input data
belonging to different classes are clustered to the respective vertices of the
new feature space with maximum purity. In the new feature space, orthogonal
hyperplanes, which are employed at the split-nodes of decision trees in random
forests, can tackle the clustering problems effectively. We evaluated our
proposed method on public benchmark datasets for regression and classification
tasks, and experiments showed that JMPF remarkably outperforms other
state-of-the-art random-forest-based approaches. Furthermore, we applied JMPF
to image super-resolution, because the transformed, compact features are more
discriminative to the clustering-regression scheme. Experiment results on
several public benchmark datasets also showed that the JMPF-based image
super-resolution scheme is consistently superior to recent state-of-the-art
image super-resolution algorithms.Comment: 18 pages, 7 figure
A Survey of the Recent Architectures of Deep Convolutional Neural Networks
Deep Convolutional Neural Network (CNN) is a special type of Neural Networks,
which has shown exemplary performance on several competitions related to
Computer Vision and Image Processing. Some of the exciting application areas of
CNN include Image Classification and Segmentation, Object Detection, Video
Processing, Natural Language Processing, and Speech Recognition. The powerful
learning ability of deep CNN is primarily due to the use of multiple feature
extraction stages that can automatically learn representations from the data.
The availability of a large amount of data and improvement in the hardware
technology has accelerated the research in CNNs, and recently interesting deep
CNN architectures have been reported. Several inspiring ideas to bring
advancements in CNNs have been explored, such as the use of different
activation and loss functions, parameter optimization, regularization, and
architectural innovations. However, the significant improvement in the
representational capacity of the deep CNN is achieved through architectural
innovations. Notably, the ideas of exploiting spatial and channel information,
depth and width of architecture, and multi-path information processing have
gained substantial attention. Similarly, the idea of using a block of layers as
a structural unit is also gaining popularity. This survey thus focuses on the
intrinsic taxonomy present in the recently reported deep CNN architectures and,
consequently, classifies the recent innovations in CNN architectures into seven
different categories. These seven categories are based on spatial exploitation,
depth, multi-path, width, feature-map exploitation, channel boosting, and
attention. Additionally, the elementary understanding of CNN components,
current challenges, and applications of CNN are also provided.Comment: Number of Pages: 70, Number of Figures: 11, Number of Tables: 11.
Artif Intell Rev (2020
Robust Emotion Recognition from Low Quality and Low Bit Rate Video: A Deep Learning Approach
Emotion recognition from facial expressions is tremendously useful,
especially when coupled with smart devices and wireless multimedia
applications. However, the inadequate network bandwidth often limits the
spatial resolution of the transmitted video, which will heavily degrade the
recognition reliability. We develop a novel framework to achieve robust emotion
recognition from low bit rate video. While video frames are downsampled at the
encoder side, the decoder is embedded with a deep network model for joint
super-resolution (SR) and recognition. Notably, we propose a novel max-mix
training strategy, leading to a single "One-for-All" model that is remarkably
robust to a vast range of downsampling factors. That makes our framework well
adapted for the varied bandwidths in real transmission scenarios, without
hampering scalability or efficiency. The proposed framework is evaluated on the
AVEC 2016 benchmark, and demonstrates significantly improved stand-alone
recognition performance, as well as rate-distortion (R-D) performance, than
either directly recognizing from LR frames, or separating SR and recognition.Comment: Accepted by the Seventh International Conference on Affective
Computing and Intelligent Interaction (ACII2017
Learning with Rethinking: Recurrently Improving Convolutional Neural Networks through Feedback
Recent years have witnessed the great success of convolutional neural network
(CNN) based models in the field of computer vision. CNN is able to learn
hierarchically abstracted features from images in an end-to-end training
manner. However, most of the existing CNN models only learn features through a
feedforward structure and no feedback information from top to bottom layers is
exploited to enable the networks to refine themselves. In this paper, we
propose a "Learning with Rethinking" algorithm. By adding a feedback layer and
producing the emphasis vector, the model is able to recurrently boost the
performance based on previous prediction. Particularly, it can be employed to
boost any pre-trained models. This algorithm is tested on four object
classification benchmark datasets: CIFAR-100, CIFAR-10, MNIST-background-image
and ILSVRC-2012 dataset. These results have demonstrated the advantage of
training CNN models with the proposed feedback mechanism
Chaining Identity Mapping Modules for Image Denoising
We propose to learn a fully-convolutional network model that consists of a
Chain of Identity Mapping Modules (CIMM) for image denoising. The CIMM
structure possesses two distinctive features that are important for the noise
removal task. Firstly, each residual unit employs identity mappings as the skip
connections and receives pre-activated input in order to preserve the gradient
magnitude propagated in both the forward and backward directions. Secondly, by
utilizing dilated kernels for the convolution layers in the residual branch, in
other words within an identity mapping module, each neuron in the last
convolution layer can observe the full receptive field of the first layer.
After being trained on the BSD400 dataset, the proposed network produces
remarkably higher numerical accuracy and better visual image quality than the
state-of-the-art when being evaluated on conventional benchmark images and the
BSD68 dataset
Pose-adaptive Hierarchical Attention Network for Facial Expression Recognition
Multi-view facial expression recognition (FER) is a challenging task because
the appearance of an expression varies in poses. To alleviate the influences of
poses, recent methods either perform pose normalization or learn separate FER
classifiers for each pose. However, these methods usually have two stages and
rely on good performance of pose estimators. Different from existing methods,
we propose a pose-adaptive hierarchical attention network (PhaNet) that can
jointly recognize the facial expressions and poses in unconstrained
environment. Specifically, PhaNet discovers the most relevant regions to the
facial expression by an attention mechanism in hierarchical scales, and the
most informative scales are then selected to learn the pose-invariant and
expression-discriminative representations. PhaNet is end-to-end trainable by
minimizing the hierarchical attention losses, the FER loss and pose loss with
dynamically learned loss weights. We validate the effectiveness of the proposed
PhaNet on three multi-view datasets (BU-3DFE, Multi-pie, and KDEF) and two
in-the-wild FER datasets (AffectNet and SFEW). Extensive experiments
demonstrate that our framework outperforms the state-of-the-arts under both
within-dataset and cross-dataset settings, achieving the average accuracies of
84.92\%, 93.53\%, 88.5\%, 54.82\% and 31.25\% respectively.Comment: 12 pages, 15 figure
Face Recognition in Low Quality Images: A Survey
Low-resolution face recognition (LRFR) has received increasing attention over
the past few years. Its applications lie widely in the real-world environment
when high-resolution or high-quality images are hard to capture. One of the
biggest demands for LRFR technologies is video surveillance. As the the number
of surveillance cameras in the city increases, the videos that captured will
need to be processed automatically. However, those videos or images are usually
captured with large standoffs, arbitrary illumination condition, and diverse
angles of view. Faces in these images are generally small in size. Several
studies addressed this problem employed techniques like super resolution,
deblurring, or learning a relationship between different resolution domains. In
this paper, we provide a comprehensive review of approaches to low-resolution
face recognition in the past five years. First, a general problem definition is
given. Later, systematically analysis of the works on this topic is presented
by catogory. In addition to describing the methods, we also focus on datasets
and experiment settings. We further address the related works on unconstrained
low-resolution face recognition and compare them with the result that use
synthetic low-resolution data. Finally, we summarized the general limitations
and speculate a priorities for the future effort.Comment: There are some mistakes addressing in this paper which will be
misleading to the reader and we wont have a new version in short time. We
will resubmit once it is being corecte
Unsupervised and Unregistered Hyperspectral Image Super-Resolution with Mutual Dirichlet-Net
Hyperspectral images (HSI) provide rich spectral information that contributed
to the successful performance improvement of numerous computer vision tasks.
However, it can only be achieved at the expense of images' spatial resolution.
Hyperspectral image super-resolution (HSI-SR) addresses this problem by fusing
low resolution (LR) HSI with multispectral image (MSI) carrying much higher
spatial resolution (HR). All existing HSI-SR approaches require the LR HSI and
HR MSI to be well registered and the reconstruction accuracy of the HR HSI
relies heavily on the registration accuracy of different modalities. This paper
exploits the uncharted problem domain of HSI-SR without the requirement of
multi-modality registration. Given the unregistered LR HSI and HR MSI with
overlapped regions, we design a unique unsupervised learning structure linking
the two unregistered modalities by projecting them into the same statistical
space through the same encoder. The mutual information (MI) is further adopted
to capture the non-linear statistical dependencies between the representations
from two modalities (carrying spatial information) and their raw inputs. By
maximizing the MI, spatial correlations between different modalities can be
well characterized to further reduce the spectral distortion. A collaborative
norm is employed as the reconstruction error instead of the more
common norm, so that individual pixels can be recovered as accurately as
possible. With this design, the network allows to extract correlated spectral
and spatial information from unregistered images that better preserves the
spectral information. The proposed method is referred to as unregistered and
unsupervised mutual Dirichlet Net (-MDN). Extensive experimental results
using benchmark HSI datasets demonstrate the superior performance of -MDN
as compared to the state-of-the-art.Comment: Submitted to IEEE Transactions on Image Processin
Transfer Metric Learning: Algorithms, Applications and Outlooks
Distance metric learning (DML) aims to find an appropriate way to reveal the
underlying data relationship. It is critical in many machine learning, pattern
recognition and data mining algorithms, and usually require large amount of
label information (such as class labels or pair/triplet constraints) to achieve
satisfactory performance. However, the label information may be insufficient in
real-world applications due to the high-labeling cost, and DML may fail in this
case. Transfer metric learning (TML) is able to mitigate this issue for DML in
the domain of interest (target domain) by leveraging knowledge/information from
other related domains (source domains). Although achieved a certain level of
development, TML has limited success in various aspects such as selective
transfer, theoretical understanding, handling complex data, big data and
extreme cases. In this survey, we present a systematic review of the TML
literature. In particular, we group TML into different categories according to
different settings and metric transfer strategies, such as direct metric
approximation, subspace approximation, distance approximation, and distribution
approximation. A summarization and insightful discussion of the various TML
approaches and their applications will be presented. Finally, we indicate some
challenges and provide possible future directions.Comment: 14 pages, 5 figure
In Defense of Single-column Networks for Crowd Counting
Crowd counting usually addressed by density estimation becomes an
increasingly important topic in computer vision due to its widespread
applications in video surveillance, urban planning, and intelligence gathering.
However, it is essentially a challenging task because of the greatly varied
sizes of objects, coupled with severe occlusions and vague appearance of
extremely small individuals. Existing methods heavily rely on multi-column
learning architectures to extract multi-scale features, which however suffer
from heavy computational cost, especially undesired for crowd counting. In this
paper, we propose the single-column counting network (SCNet) for efficient
crowd counting without relying on multi-column networks. SCNet consists of
residual fusion modules (RFMs) for multi-scale feature extraction, a pyramid
pooling module (PPM) for information fusion, and a sub-pixel convolutional
module (SPCM) followed by a bilinear upsampling layer for resolution recovery.
Those proposed modules enable our SCNet to fully capture multi-scale features
in a compact single-column architecture and estimate high-resolution density
map in an efficient way. In addition, we provide a principled paradigm for
density map generation and data augmentation for training, which shows further
improved performance. Extensive experiments on three benchmark datasets show
that our SCNet delivers new state-of-the-art performance and surpasses previous
methods by large margins, which demonstrates the great effectiveness of SCNet
as a single-column network for crowd counting
- …