1,713 research outputs found
Towards Coding for Human and Machine Vision: A Scalable Image Coding Approach
The past decades have witnessed the rapid development of image and video
coding techniques in the era of big data. However, the signal fidelity-driven
coding pipeline design limits the capability of the existing image/video coding
frameworks to fulfill the needs of both machine and human vision. In this
paper, we come up with a novel image coding framework by leveraging both the
compressive and the generative models, to support machine vision and human
perception tasks jointly. Given an input image, the feature analysis is first
applied, and then the generative model is employed to perform image
reconstruction with features and additional reference pixels, in which compact
edge maps are extracted in this work to connect both kinds of vision in a
scalable way. The compact edge map serves as the basic layer for machine vision
tasks, and the reference pixels act as a sort of enhanced layer to guarantee
signal fidelity for human vision. By introducing advanced generative models, we
train a flexible network to reconstruct images from compact feature
representations and the reference pixels. Experimental results demonstrate the
superiority of our framework in both human visual quality and facial landmark
detection, which provide useful evidence on the emerging standardization
efforts on MPEG VCM (Video Coding for Machine).Comment: Project page: https://williamyang1991.github.io/projects/VCM-Face
Robust Emotion Recognition from Low Quality and Low Bit Rate Video: A Deep Learning Approach
Emotion recognition from facial expressions is tremendously useful,
especially when coupled with smart devices and wireless multimedia
applications. However, the inadequate network bandwidth often limits the
spatial resolution of the transmitted video, which will heavily degrade the
recognition reliability. We develop a novel framework to achieve robust emotion
recognition from low bit rate video. While video frames are downsampled at the
encoder side, the decoder is embedded with a deep network model for joint
super-resolution (SR) and recognition. Notably, we propose a novel max-mix
training strategy, leading to a single "One-for-All" model that is remarkably
robust to a vast range of downsampling factors. That makes our framework well
adapted for the varied bandwidths in real transmission scenarios, without
hampering scalability or efficiency. The proposed framework is evaluated on the
AVEC 2016 benchmark, and demonstrates significantly improved stand-alone
recognition performance, as well as rate-distortion (R-D) performance, than
either directly recognizing from LR frames, or separating SR and recognition.Comment: Accepted by the Seventh International Conference on Affective
Computing and Intelligent Interaction (ACII2017
Deep Sparse Subspace Clustering
In this paper, we present a deep extension of Sparse Subspace Clustering,
termed Deep Sparse Subspace Clustering (DSSC). Regularized by the unit sphere
distribution assumption for the learned deep features, DSSC can infer a new
data affinity matrix by simultaneously satisfying the sparsity principle of SSC
and the nonlinearity given by neural networks. One of the appealing advantages
brought by DSSC is: when original real-world data do not meet the
class-specific linear subspace distribution assumption, DSSC can employ neural
networks to make the assumption valid with its hierarchical nonlinear
transformations. To the best of our knowledge, this is among the first deep
learning based subspace clustering methods. Extensive experiments are conducted
on four real-world datasets to show the proposed DSSC is significantly superior
to 12 existing methods for subspace clustering.Comment: The initial version is completed at the beginning of 201
Deferred Neural Rendering: Image Synthesis using Neural Textures
The modern computer graphics pipeline can synthesize images at remarkable
visual quality; however, it requires well-defined, high-quality 3D content as
input. In this work, we explore the use of imperfect 3D content, for instance,
obtained from photo-metric reconstructions with noisy and incomplete surface
geometry, while still aiming to produce photo-realistic (re-)renderings. To
address this challenging problem, we introduce Deferred Neural Rendering, a new
paradigm for image synthesis that combines the traditional graphics pipeline
with learnable components. Specifically, we propose Neural Textures, which are
learned feature maps that are trained as part of the scene capture process.
Similar to traditional textures, neural textures are stored as maps on top of
3D mesh proxies; however, the high-dimensional feature maps contain
significantly more information, which can be interpreted by our new deferred
neural rendering pipeline. Both neural textures and deferred neural renderer
are trained end-to-end, enabling us to synthesize photo-realistic images even
when the original 3D content was imperfect. In contrast to traditional,
black-box 2D generative neural networks, our 3D representation gives us
explicit control over the generated output, and allows for a wide range of
application domains. For instance, we can synthesize temporally-consistent
video re-renderings of recorded 3D scenes as our representation is inherently
embedded in 3D space. This way, neural textures can be utilized to coherently
re-render or manipulate existing video content in both static and dynamic
environments at real-time rates. We show the effectiveness of our approach in
several experiments on novel view synthesis, scene editing, and facial
reenactment, and compare to state-of-the-art approaches that leverage the
standard graphics pipeline as well as conventional generative neural networks.Comment: Video: https://youtu.be/z-pVip6WeyY SIGGRAPH 201
Deep Boosting: Joint Feature Selection and Analysis Dictionary Learning in Hierarchy
This work investigates how the traditional image classification pipelines can
be extended into a deep architecture, inspired by recent successes of deep
neural networks. We propose a deep boosting framework based on layer-by-layer
joint feature boosting and dictionary learning. In each layer, we construct a
dictionary of filters by combining the filters from the lower layer, and
iteratively optimize the image representation with a joint
discriminative-generative formulation, i.e. minimization of empirical
classification error plus regularization of analysis image generation over
training images. For optimization, we perform two iterating steps: i) to
minimize the classification error, select the most discriminative features
using the gentle adaboost algorithm; ii) according to the feature selection,
update the filters to minimize the regularization on analysis image
representation using the gradient descent method. Once the optimization is
converged, we learn the higher layer representation in the same way. Our model
delivers several distinct advantages. First, our layer-wise optimization
provides the potential to build very deep architectures. Second, the generated
image representation is compact and meaningful. In several visual recognition
tasks, our framework outperforms existing state-of-the-art approaches
Recover Canonical-View Faces in the Wild with Deep Neural Networks
Face images in the wild undergo large intra-personal variations, such as
poses, illuminations, occlusions, and low resolutions, which cause great
challenges to face-related applications. This paper addresses this challenge by
proposing a new deep learning framework that can recover the canonical view of
face images. It dramatically reduces the intra-person variances, while
maintaining the inter-person discriminativeness. Unlike the existing face
reconstruction methods that were either evaluated in controlled 2D environment
or employed 3D information, our approach directly learns the transformation
from the face images with a complex set of variations to their canonical views.
At the training stage, to avoid the costly process of labeling canonical-view
images from the training set by hand, we have devised a new measurement to
automatically select or synthesize a canonical-view image for each identity. As
an application, this face recovery approach is used for face verification.
Facial features are learned from the recovered canonical-view face images by
using a facial component-based convolutional neural network. Our approach
achieves the state-of-the-art performance on the LFW dataset
End-to-End Facial Deep Learning Feature Compression with Teacher-Student Enhancement
In this paper, we propose a novel end-to-end feature compression scheme by
leveraging the representation and learning capability of deep neural networks,
towards intelligent front-end equipped analysis with promising accuracy and
efficiency. In particular, the extracted features are compactly coded in an
end-to-end manner by optimizing the rate-distortion cost to achieve
feature-in-feature representation. In order to further improve the compression
performance, we present a latent code level teacher-student enhancement model,
which could efficiently transfer the low bit-rate representation into a high
bit rate one. Such a strategy further allows us to adaptively shift the
representation cost to decoding computations, leading to more flexible feature
compression with enhanced decoding capability. We verify the effectiveness of
the proposed model with the facial feature, and experimental results reveal
better compression performance in terms of rate-accuracy compared with existing
models
Deep Learning-Based Video Coding: A Review and A Case Study
The past decade has witnessed great success of deep learning technology in
many disciplines, especially in computer vision and image processing. However,
deep learning-based video coding remains in its infancy. This paper reviews the
representative works about using deep learning for image/video coding, which
has been an actively developing research area since the year of 2015. We divide
the related works into two categories: new coding schemes that are built
primarily upon deep networks (deep schemes), and deep network-based coding
tools (deep tools) that shall be used within traditional coding schemes or
together with traditional coding tools. For deep schemes, pixel probability
modeling and auto-encoder are the two approaches, that can be viewed as
predictive coding scheme and transform coding scheme, respectively. For deep
tools, there have been several proposed techniques using deep learning to
perform intra-picture prediction, inter-picture prediction, cross-channel
prediction, probability distribution prediction, transform, post- or in-loop
filtering, down- and up-sampling, as well as encoding optimizations. In the
hope of advocating the research of deep learning-based video coding, we present
a case study of our developed prototype video codec, namely Deep Learning Video
Coding (DLVC). DLVC features two deep tools that are both based on
convolutional neural network (CNN), namely CNN-based in-loop filter (CNN-ILF)
and CNN-based block adaptive resolution coding (CNN-BARC). Both tools help
improve the compression efficiency by a significant margin. With the two deep
tools as well as other non-deep coding tools, DLVC is able to achieve on
average 39.6\% and 33.0\% bits saving than HEVC, under random-access and
low-delay configurations, respectively. The source code of DLVC has been
released for future researches
Privacy-Preserving Deep Inference for Rich User Data on The Cloud
Deep neural networks are increasingly being used in a variety of machine
learning applications applied to rich user data on the cloud. However, this
approach introduces a number of privacy and efficiency challenges, as the cloud
operator can perform secondary inferences on the available data. Recently,
advances in edge processing have paved the way for more efficient, and private,
data processing at the source for simple tasks and lighter models, though they
remain a challenge for larger, and more complicated models. In this paper, we
present a hybrid approach for breaking down large, complex deep models for
cooperative, privacy-preserving analytics. We do this by breaking down the
popular deep architectures and fine-tune them in a particular way. We then
evaluate the privacy benefits of this approach based on the information exposed
to the cloud service. We also asses the local inference cost of different
layers on a modern handset for mobile applications. Our evaluations show that
by using certain kind of fine-tuning and embedding techniques and at a small
processing costs, we can greatly reduce the level of information available to
unintended tasks applied to the data feature on the cloud, and hence achieving
the desired tradeoff between privacy and performance.Comment: arXiv admin note: substantial text overlap with arXiv:1703.0295
Convolutional Neural Networks with Transformed Input based on Robust Tensor Network Decomposition
Tensor network decomposition, originated from quantum physics to model
entangled many-particle quantum systems, turns out to be a promising
mathematical technique to efficiently represent and process big data in
parsimonious manner. In this study, we show that tensor networks can
systematically partition structured data, e.g. color images, for distributed
storage and communication in privacy-preserving manner. Leveraging the sea of
big data and metadata privacy, empirical results show that neighbouring
subtensors with implicit information stored in tensor network formats cannot be
identified for data reconstruction. This technique complements the existing
encryption and randomization techniques which store explicit data
representation at one place and highly susceptible to adversarial attacks such
as side-channel attacks and de-anonymization. Furthermore, we propose a theory
for adversarial examples that mislead convolutional neural networks to
misclassification using subspace analysis based on singular value decomposition
(SVD). The theory is extended to analyze higher-order tensors using
tensor-train SVD (TT-SVD); it helps to explain the level of susceptibility of
different datasets to adversarial attacks, the structural similarity of
different adversarial attacks including global and localized attacks, and the
efficacy of different adversarial defenses based on input transformation. An
efficient and adaptive algorithm based on robust TT-SVD is then developed to
detect strong and static adversarial attacks
- …