393 research outputs found
Attention: A Big Surprise for Cross-Domain Person Re-Identification
In this paper, we focus on model generalization and adaptation for
cross-domain person re-identification (Re-ID). Unlike existing cross-domain
Re-ID methods, leveraging the auxiliary information of those unlabeled
target-domain data, we aim at enhancing the model generalization and adaptation
by discriminative feature learning, and directly exploiting a pre-trained model
to new domains (datasets) without any utilization of the information from
target domains. To address the discriminative feature learning problem, we
surprisingly find that simply introducing the attention mechanism to adaptively
extract the person features for every domain is of great effectiveness. We
adopt two popular type of attention mechanisms, long-range dependency based
attention and direct generation based attention. Both of them can perform the
attention via spatial or channel dimensions alone, even the combination of
spatial and channel dimensions. The outline of different attentions are well
illustrated. Moreover, we also incorporate the attention results into the final
output of model through skip-connection to improve the features with both high
and middle level semantic visual information. In the manner of directly
exploiting a pre-trained model to new domains, the attention incorporation
method truly could enhance the model generalization and adaptation to perform
the cross-domain person Re-ID. We conduct extensive experiments between three
large datasets, Market-1501, DukeMTMC-reID and MSMT17. Surprisingly,
introducing only attention can achieve state-of-the-art performance, even much
better than those cross-domain Re-ID methods utilizing auxiliary information
from the target domain
Dual Purpose Hashing
Recent years have seen more and more demand for a unified framework to
address multiple realistic image retrieval tasks concerning both category and
attributes. Considering the scale of modern datasets, hashing is favorable for
its low complexity. However, most existing hashing methods are designed to
preserve one single kind of similarity, thus improper for dealing with the
different tasks simultaneously. To overcome this limitation, we propose a new
hashing method, named Dual Purpose Hashing (DPH), which jointly preserves the
category and attribute similarities by exploiting the Convolutional Neural
Network (CNN) models to hierarchically capture the correlations between
category and attributes. Since images with both category and attribute labels
are scarce, our method is designed to take the abundant partially labelled
images on the Internet as training inputs. With such a framework, the binary
codes of new-coming images can be readily obtained by quantizing the network
outputs of a binary-like layer, and the attributes can be recovered from the
codes easily. Experiments on two large-scale datasets show that our dual
purpose hash codes can achieve comparable or even better performance than those
state-of-the-art methods specifically designed for each individual retrieval
task, while being more compact than the compared methods.Comment: With supplementary materials added to the en
Maximum mutual information regularized classification
In this paper, a novel pattern classification approach is proposed by
regularizing the classifier learning to maximize mutual information between the
classification response and the true class label. We argue that, with the
learned classifier, the uncertainty of the true class label of a data sample
should be reduced by knowing its classification response as much as possible.
The reduced uncertainty is measured by the mutual information between the
classification response and the true class label. To this end, when learning a
linear classifier, we propose to maximize the mutual information between
classification responses and true class labels of training samples, besides
minimizing the classification error and reduc- ing the classifier complexity.
An objective function is constructed by modeling mutual information with
entropy estimation, and it is optimized by a gradi- ent descend method in an
iterative algorithm. Experiments on two real world pattern classification
problems show the significant improvements achieved by maximum mutual
information regularization
Fully Learnable Group Convolution for Acceleration of Deep Neural Networks
Benefitted from its great success on many tasks, deep learning is
increasingly used on low-computational-cost devices, e.g. smartphone, embedded
devices, etc. To reduce the high computational and memory cost, in this work,
we propose a fully learnable group convolution module (FLGC for short) which is
quite efficient and can be embedded into any deep neural networks for
acceleration. Specifically, our proposed method automatically learns the group
structure in the training stage in a fully end-to-end manner, leading to a
better structure than the existing pre-defined, two-steps, or iterative
strategies. Moreover, our method can be further combined with depthwise
separable convolution, resulting in 5 times acceleration than the vanilla
Resnet50 on single CPU. An additional advantage is that in our FLGC the number
of groups can be set as any value, but not necessarily 2^k as in most existing
methods, meaning better tradeoff between accuracy and speed. As evaluated in
our experiments, our method achieves better performance than existing learnable
group convolution and standard group convolution when using the same number of
groups.Comment: Accepted by CVPR 201
Learning Expressionlets via Universal Manifold Model for Dynamic Facial Expression Recognition
Facial expression is temporally dynamic event which can be decomposed into a
set of muscle motions occurring in different facial regions over various time
intervals. For dynamic expression recognition, two key issues, temporal
alignment and semantics-aware dynamic representation, must be taken into
account. In this paper, we attempt to solve both problems via manifold modeling
of videos based on a novel mid-level representation, i.e.
\textbf{expressionlet}. Specifically, our method contains three key stages: 1)
each expression video clip is characterized as a spatial-temporal manifold
(STM) formed by dense low-level features; 2) a Universal Manifold Model (UMM)
is learned over all low-level features and represented as a set of local modes
to statistically unify all the STMs. 3) the local modes on each STM can be
instantiated by fitting to UMM, and the corresponding expressionlet is
constructed by modeling the variations in each local mode. With above strategy,
expression videos are naturally aligned both spatially and temporally. To
enhance the discriminative power, the expressionlet-based STM representation is
further processed with discriminant embedding. Our method is evaluated on four
public expression databases, CK+, MMI, Oulu-CASIA, and FERA. In all cases, our
method outperforms the known state-of-the-art by a large margin.Comment: 12 page
Structure Inference Net: Object Detection Using Scene-Level Context and Instance-Level Relationships
Context is important for accurate visual recognition. In this work we propose
an object detection algorithm that not only considers object visual appearance,
but also makes use of two kinds of context including scene contextual
information and object relationships within a single image. Therefore, object
detection is regarded as both a cognition problem and a reasoning problem when
leveraging these structured information. Specifically, this paper formulates
object detection as a problem of graph structure inference, where given an
image the objects are treated as nodes in a graph and relationships between the
objects are modeled as edges in such graph. To this end, we present a so-called
Structure Inference Network (SIN), a detector that incorporates into a typical
detection framework (e.g. Faster R-CNN) with a graphical model which aims to
infer object state. Comprehensive experiments on PASCAL VOC and MS COCO
datasets indicate that scene context and object relationships truly improve the
performance of object detection with more desirable and reasonable outputs.Comment: published in CVPR 201
Temporal Action Detection by Joint Identification-Verification
Temporal action detection aims at not only recognizing action category but
also detecting start time and end time for each action instance in an untrimmed
video. The key challenge of this task is to accurately classify the action and
determine the temporal boundaries of each action instance. In temporal action
detection benchmark: THUMOS 2014, large variations exist in the same action
category while many similarities exist in different action categories, which
always limit the performance of temporal action detection. To address this
problem, we propose to use joint Identification-Verification network to reduce
the intra-action variations and enlarge inter-action differences. The joint
Identification-Verification network is a siamese network based on 3D ConvNets,
which can simultaneously predict the action categories and the similarity
scores for the input pairs of video proposal segments. Extensive experimental
results on the challenging THUMOS 2014 dataset demonstrate the effectiveness of
our proposed method compared to the existing state-of-art methods for temporal
action detection in untrimmed videos
Design and Implementation of a General Decision-making Model in RoboCup Simulation
The study of the collaboration, coordination and negotiation among different
agents in a multi-agent system (MAS) has always been the most challenging yet
popular in the research of distributed artificial intelligence. In this paper,
we will suggest for RoboCup simulation, a typical MAS, a general
decision-making model, rather than define a different algorithm for each tactic
(e.g. ball handling, pass, shoot and interception, etc.) in soccer games as
most RoboCup simulation teams did. The general decision-making model is based
on two critical factors in soccer games: the vertical distance to the goal line
and the visual angle for the goalpost. We have used these two parameters to
formalize the defensive and offensive decisions in RoboCup simulation and the
results mentioned above had been applied in NOVAURO, original name is UJDB, a
RoboCup simulation team of Jiangsu University, whose decision-making model,
compared with that of Tsinghua University, the world champion team in 2001, is
a universal model and easier to be implemented
Learning Mid-level Words on Riemannian Manifold for Action Recognition
Human action recognition remains a challenging task due to the various
sources of video data and large intra-class variations. It thus becomes one of
the key issues in recent research to explore effective and robust
representation to handle such challenges. In this paper, we propose a novel
representation approach by constructing mid-level words in videos and encoding
them on Riemannian manifold. Specifically, we first conduct a global alignment
on the densely extracted low-level features to build a bank of corresponding
feature groups, each of which can be statistically modeled as a mid-level word
lying on some specific Riemannian manifold. Based on these mid-level words, we
construct intrinsic Riemannian codebooks by employing K-Karcher-means
clustering and Riemannian Gaussian Mixture Model, and consequently extend the
Riemannian manifold version of three well studied encoding methods in Euclidean
space, i.e. Bag of Visual Words (BoVW), Vector of Locally Aggregated
Descriptors (VLAD), and Fisher Vector (FV), to obtain the final action video
representations. Our method is evaluated in two tasks on four popular realistic
datasets: action recognition on YouTube, UCF50, HMDB51 databases, and action
similarity labeling on ASLAN database. In all cases, the reported results
achieve very competitive performance with those most recent state-of-the-art
works.Comment: 10 page
Learning Class Prototypes via Structure Alignment for Zero-Shot Recognition
Zero-shot learning (ZSL) aims to recognize objects of novel classes without
any training samples of specific classes, which is achieved by exploiting the
semantic information and auxiliary datasets. Recently most ZSL approaches focus
on learning visual-semantic embeddings to transfer knowledge from the auxiliary
datasets to the novel classes. However, few works study whether the semantic
information is discriminative or not for the recognition task. To tackle such
problem, we propose a coupled dictionary learning approach to align the
visual-semantic structures using the class prototypes, where the discriminative
information lying in the visual space is utilized to improve the less
discriminative semantic space. Then, zero-shot recognition can be performed in
different spaces by the simple nearest neighbor approach using the learned
class prototypes. Extensive experiments on four benchmark datasets show the
effectiveness of the proposed approach.Comment: To appear in ECCV 201
- …