6,805 research outputs found
Non-Volume Preserving-based Feature Fusion Approach to Group-Level Expression Recognition on Crowd Videos
Group-level emotion recognition (ER) is a growing research area as the
demands for assessing crowds of all sizes is becoming an interest in both the
security arena as well as social media. This work extends the earlier ER
investigations, which focused on either group-level ER on single images or
within a video, by fully investigating group-level expression recognition on
crowd videos. In this paper, we propose an effective deep feature level fusion
mechanism to model the spatial-temporal information in the crowd videos. In our
approach, the fusing process is performed on deep feature domain by a
generative probabilistic model, Non-Volume Preserving Fusion (NVPF), that
models spatial information relationship. Furthermore, we extend our proposed
spatial NVPF approach to spatial-temporal NVPF approach to learn the temporal
information between frames. In order to demonstrate the robustness and
effectiveness of each component in the proposed approach, three experiments
were conducted: (i) evaluation on AffectNet database to benchmark the proposed
EmoNet for recognizing facial expression; (ii) evaluation on EmotiW2018 to
benchmark the proposed deep feature level fusion mechanism NVPF; and, (iii)
examine the proposed TNVPF on an innovative Group-level Emotion on Crowd Videos
(GECV) dataset composed of 627 videos collected from publicly available
sources. GECV dataset is a collection of videos containing crowds of people.
Each video is labeled with emotion categories at three levels: individual
faces, group of people and the entire video frame.Comment: Under review at Patter Recognitio
A Comprehensive Survey on Cross-modal Retrieval
In recent years, cross-modal retrieval has drawn much attention due to the
rapid growth of multimodal data. It takes one type of data as the query to
retrieve relevant data of another type. For example, a user can use a text to
retrieve relevant pictures or videos. Since the query and its retrieved results
can be of different modalities, how to measure the content similarity between
different modalities of data remains a challenge. Various methods have been
proposed to deal with such a problem. In this paper, we first review a number
of representative methods for cross-modal retrieval and classify them into two
main groups: 1) real-valued representation learning, and 2) binary
representation learning. Real-valued representation learning methods aim to
learn real-valued common representations for different modalities of data. To
speed up the cross-modal retrieval, a number of binary representation learning
methods are proposed to map different modalities of data into a common Hamming
space. Then, we introduce several multimodal datasets in the community, and
show the experimental results on two commonly used multimodal datasets. The
comparison reveals the characteristic of different kinds of cross-modal
retrieval methods, which is expected to benefit both practical applications and
future research. Finally, we discuss open problems and future research
directions.Comment: 20 pages, 11 figures, 9 table
Multi-view Laplacian Eigenmaps Based on Bag-of-Neighbors For RGBD Human Emotion Recognition
Human emotion recognition is an important direction in the field of biometric
and information forensics. However, most existing human emotion research are
based on the single RGB view. In this paper, we introduce a RGBD video-emotion
dataset and a RGBD face-emotion dataset for research. To our best knowledge,
this may be the first RGBD video-emotion dataset. We propose a new supervised
nonlinear multi-view laplacian eigenmaps (MvLE) approach and a
multihidden-layer out-of-sample network (MHON) for RGB-D humanemotion
recognition. To get better representations of RGB view and depth view, MvLE is
used to map the training set of both views from original space into the common
subspace. As RGB view and depth view lie in different spaces, a new distance
metric bag of neighbors (BON) used in MvLE can get the similar distributions of
the two views. Finally, MHON is used to get the low-dimensional representations
of test data and predict their labels. MvLE can deal with the cases that RGB
view and depth view have different size of features, even different number of
samples and classes. And our methods can be easily extended to more than two
views. The experiment results indicate the effectiveness of our methods over
some state-of-art methods
Privacy-Preserving Deep Inference for Rich User Data on The Cloud
Deep neural networks are increasingly being used in a variety of machine
learning applications applied to rich user data on the cloud. However, this
approach introduces a number of privacy and efficiency challenges, as the cloud
operator can perform secondary inferences on the available data. Recently,
advances in edge processing have paved the way for more efficient, and private,
data processing at the source for simple tasks and lighter models, though they
remain a challenge for larger, and more complicated models. In this paper, we
present a hybrid approach for breaking down large, complex deep models for
cooperative, privacy-preserving analytics. We do this by breaking down the
popular deep architectures and fine-tune them in a particular way. We then
evaluate the privacy benefits of this approach based on the information exposed
to the cloud service. We also asses the local inference cost of different
layers on a modern handset for mobile applications. Our evaluations show that
by using certain kind of fine-tuning and embedding techniques and at a small
processing costs, we can greatly reduce the level of information available to
unintended tasks applied to the data feature on the cloud, and hence achieving
the desired tradeoff between privacy and performance.Comment: arXiv admin note: substantial text overlap with arXiv:1703.0295
Cross-modal Subspace Learning via Kernel Correlation Maximization and Discriminative Structure Preserving
The measure between heterogeneous data is still an open problem. Many
research works have been developed to learn a common subspace where the
similarity between different modalities can be calculated directly. However,
most of existing works focus on learning a latent subspace but the semantically
structural information is not well preserved. Thus, these approaches cannot get
desired results. In this paper, we propose a novel framework, termed
Cross-modal subspace learning via Kernel correlation maximization and
Discriminative structure-preserving (CKD), to solve this problem in two
aspects. Firstly, we construct a shared semantic graph to make each modality
data preserve the neighbor relationship semantically. Secondly, we introduce
the Hilbert-Schmidt Independence Criteria (HSIC) to ensure the consistency
between feature-similarity and semantic-similarity of samples. Our model not
only considers the inter-modality correlation by maximizing the kernel
correlation but also preserves the semantically structural information within
each modality. The extensive experiments are performed to evaluate the proposed
framework on the three public datasets. The experimental results demonstrated
that the proposed CKD is competitive compared with the classic subspace
learning methods.Comment: The paper is under consideration at Multimedia Tools and Application
Learning Two-Branch Neural Networks for Image-Text Matching Tasks
Image-language matching tasks have recently attracted a lot of attention in
the computer vision field. These tasks include image-sentence matching, i.e.,
given an image query, retrieving relevant sentences and vice versa, and
region-phrase matching or visual grounding, i.e., matching a phrase to relevant
regions. This paper investigates two-branch neural networks for learning the
similarity between these two data modalities. We propose two network structures
that produce different output representations. The first one, referred to as an
embedding network, learns an explicit shared latent embedding space with a
maximum-margin ranking loss and novel neighborhood constraints. Compared to
standard triplet sampling, we perform improved neighborhood sampling that takes
neighborhood information into consideration while constructing mini-batches.
The second network structure, referred to as a similarity network, fuses the
two branches via element-wise product and is trained with regression loss to
directly predict a similarity score. Extensive experiments show that our
networks achieve high accuracies for phrase localization on the Flickr30K
Entities dataset and for bi-directional image-sentence retrieval on Flickr30K
and MSCOCO datasets.Comment: accepted version in TPAMI 201
Mask-Guided Portrait Editing with Conditional GANs
Portrait editing is a popular subject in photo manipulation. The Generative
Adversarial Network (GAN) advances the generating of realistic faces and allows
more face editing. In this paper, we argue about three issues in existing
techniques: diversity, quality, and controllability for portrait synthesis and
editing. To address these issues, we propose a novel end-to-end learning
framework that leverages conditional GANs guided by provided face masks for
generating faces. The framework learns feature embeddings for every face
component (e.g., mouth, hair, eye), separately, contributing to better
correspondences for image translation, and local face editing. With the mask,
our network is available to many applications, like face synthesis driven by
mask, face Swap+ (including hair in swapping), and local manipulation. It can
also boost the performance of face parsing a bit as an option of data
augmentation.Comment: To appear in CVPR201
Multimodal Deep Network Embedding with Integrated Structure and Attribute Information
Network embedding is the process of learning low-dimensional representations
for nodes in a network, while preserving node features. Existing studies only
leverage network structure information and focus on preserving structural
features. However, nodes in real-world networks often have a rich set of
attributes providing extra semantic information. It has been demonstrated that
both structural and attribute features are important for network analysis
tasks. To preserve both features, we investigate the problem of integrating
structure and attribute information to perform network embedding and propose a
Multimodal Deep Network Embedding (MDNE) method. MDNE captures the non-linear
network structures and the complex interactions among structures and
attributes, using a deep model consisting of multiple layers of non-linear
functions. Since structures and attributes are two different types of
information, a multimodal learning method is adopted to pre-process them and
help the model to better capture the correlations between node structure and
attribute information. We employ both structural proximity and attribute
proximity in the loss function to preserve the respective features and the
representations are obtained by minimizing the loss function. Results of
extensive experiments on four real-world datasets show that the proposed method
performs significantly better than baselines on a variety of tasks, which
demonstrate the effectiveness and generality of our method.Comment: 15 pages, 10 figure
TransGaGa: Geometry-Aware Unsupervised Image-to-Image Translation
Unsupervised image-to-image translation aims at learning a mapping between
two visual domains. However, learning a translation across large geometry
variations always ends up with failure. In this work, we present a novel
disentangle-and-translate framework to tackle the complex objects
image-to-image translation task. Instead of learning the mapping on the image
space directly, we disentangle image space into a Cartesian product of the
appearance and the geometry latent spaces. Specifically, we first introduce a
geometry prior loss and a conditional VAE loss to encourage the network to
learn independent but complementary representations. The translation is then
built on appearance and geometry space separately. Extensive experiments
demonstrate the superior performance of our method to other state-of-the-art
approaches, especially in the challenging near-rigid and non-rigid objects
translation tasks. In addition, by taking different exemplars as the appearance
references, our method also supports multimodal translation. Project page:
https://wywu.github.io/projects/TGaGa/TGaGa.htmlComment: Accepted to CVPR 2019. Project page:
https://wywu.github.io/projects/TGaGa/TGaGa.htm
Learning Structured Semantic Embeddings for Visual Recognition
Numerous embedding models have been recently explored to incorporate semantic
knowledge into visual recognition. Existing methods typically focus on
minimizing the distance between the corresponding images and texts in the
embedding space but do not explicitly optimize the underlying structure. Our
key observation is that modeling the pairwise image-image relationship improves
the discrimination ability of the embedding model. In this paper, we propose
the structured discriminative and difference constraints to learn
visual-semantic embeddings. First, we exploit the discriminative constraints to
capture the intra- and inter-class relationships of image embeddings. The
discriminative constraints encourage separability for image instances of
different classes. Second, we align the difference vector between a pair of
image embeddings with that of the corresponding word embeddings. The difference
constraints help regularize image embeddings to preserve the semantic
relationships among word embeddings. Extensive evaluations demonstrate the
effectiveness of the proposed structured embeddings for single-label
classification, multi-label classification, and zero-shot recognition.Comment: 9 pages, 6 figures, 5 tables, conferenc
- …