8,195 research outputs found
Dual-Triplet Metric Learning for Unsupervised Domain Adaptation in Video-Based Face Recognition
The scalability and complexity of deep learning models remains a key issue in
many of visual recognition applications like, e.g., video surveillance, where
fine tuning with labeled image data from each new camera is required to reduce
the domain shift between videos captured from the source domain, e.g., a
laboratory setting, and the target domain, i.e, an operational environment. In
many video surveillance applications, like face recognition (FR) and person
re-identification, a pair-wise matcher is used to assign a query image captured
using a video camera to the corresponding reference images in a gallery. The
different configurations and operational conditions of video cameras can
introduce significant shifts in the pair-wise distance distributions, resulting
in degraded recognition performance for new cameras. In this paper, a new deep
domain adaptation (DA) method is proposed to adapt the CNN embedding of a
Siamese network using unlabeled tracklets captured with a new video cameras. To
this end, a dual-triplet loss is introduced for metric learning, where two
triplets are constructed using video data from a source camera, and a new
target camera. In order to constitute the dual triplets, a mutual-supervised
learning approach is introduced where the source camera acts as a teacher,
providing the target camera with an initial embedding. Then, the student relies
on the teacher to iteratively label the positive and negative pairs collected
during, e.g., initial camera calibration. Both source and target embeddings
continue to simultaneously learn such that their pair-wise distance
distributions become aligned. For validation, the proposed metric learning
technique is used to train deep Siamese networks under different training
scenarios, and is compared to state-of-the-art techniques for still-to-video FR
on the COX-S2V and a private video-based FR dataset.Comment: Submitted too IJCNN202
Deep Learning Architectures for Face Recognition in Video Surveillance
Face recognition (FR) systems for video surveillance (VS) applications
attempt to accurately detect the presence of target individuals over a
distributed network of cameras. In video-based FR systems, facial models of
target individuals are designed a priori during enrollment using a limited
number of reference still images or video data. These facial models are not
typically representative of faces being observed during operations due to large
variations in illumination, pose, scale, occlusion, blur, and to camera
inter-operability. Specifically, in still-to-video FR application, a single
high-quality reference still image captured with still camera under controlled
conditions is employed to generate a facial model to be matched later against
lower-quality faces captured with video cameras under uncontrolled conditions.
Current video-based FR systems can perform well on controlled scenarios, while
their performance is not satisfactory in uncontrolled scenarios mainly because
of the differences between the source (enrollment) and the target (operational)
domains. Most of the efforts in this area have been toward the design of robust
video-based FR systems in unconstrained surveillance environments. This chapter
presents an overview of recent advances in still-to-video FR scenario through
deep convolutional neural networks (CNNs). In particular, deep learning
architectures proposed in the literature based on triplet-loss function (e.g.,
cross-correlation matching CNN, trunk-branch ensemble CNN and HaarNet) and
supervised autoencoders (e.g., canonical face representation CNN) are reviewed
and compared in terms of accuracy and computational complexity
Recurrent Embedding Aggregation Network for Video Face Recognition
Recurrent networks have been successful in analyzing temporal data and have
been widely used for video analysis. However, for video face recognition, where
the base CNNs trained on large-scale data already provide discriminative
features, using Long Short-Term Memory (LSTM), a popular recurrent network, for
feature learning could lead to overfitting and degrade the performance instead.
We propose a Recurrent Embedding Aggregation Network (REAN) for set to set face
recognition. Compared with LSTM, REAN is robust against overfitting because it
only learns how to aggregate the pre-trained embeddings rather than learning
representations from scratch. Compared with quality-aware aggregation methods,
REAN can take advantage of the context information to circumvent the noise
introduced by redundant video frames. Empirical results on three public domain
video face recognition datasets, IJB-S, YTF, and PaSC show that the proposed
REAN significantly outperforms naive CNN-LSTM structure and quality-aware
aggregation methods
Single Image Action Recognition by Predicting Space-Time Saliency
We propose a novel approach based on deep Convolutional Neural Networks (CNN)
to recognize human actions in still images by predicting the future motion, and
detecting the shape and location of the salient parts of the image. We make the
following major contributions to this important area of research: (i) We use
the predicted future motion in the static image (Walker et al., 2015) as a
means of compensating for the missing temporal information, while using the
saliency map to represent the the spatial information in the form of location
and shape of what is predicted as significant. (ii) We cast action
classification in static images as a domain adaptation problem by transfer
learning. We first map the input static image to a new domain that we refer to
as the Predicted Optical Flow-Saliency Map domain (POF-SM), and then fine-tune
the layers of a deep CNN model trained on classifying the ImageNet dataset to
perform action classification in the POF-SM domain. (iii) We tested our method
on the popular Willow dataset. But unlike existing methods, we also tested on a
more realistic and challenging dataset of over 2M still images that we
collected and labeled by taking random frames from the UCF-101 video dataset.
We call our dataset the UCF Still Image dataset or UCFSI-101 in short. Our
results outperform the state of the art
Imitating Targets from all sides: An Unsupervised Transfer Learning method for Person Re-identification
Person re-identification (Re-ID) models usually show a limited performance
when they are trained on one dataset and tested on another dataset due to the
inter-dataset bias (e.g. completely different identities and backgrounds) and
the intra-dataset difference (e.g. camera invariance). In terms of this issue,
given a labelled source training set and an unlabelled target training set, we
propose an unsupervised transfer learning method characterized by 1) bridging
inter-dataset bias and intra-dataset difference via a proposed ImitateModel
simultaneously; 2) regarding the unsupervised person Re-ID problem as a
semi-supervised learning problem formulated by a dual classification loss to
learn a discriminative representation across domains; 3) exploiting the
underlying commonality across different domains from the class-style space to
improve the generalization ability of re-ID models. Extensive experiments are
conducted on two widely employed benchmarks, including Market-1501 and
DukeMTMC-reID, and experimental results demonstrate that the proposed method
can achieve a competitive performance against other state-of-the-art
unsupervised Re-ID approaches
Cross-Domain Visual Matching via Generalized Similarity Measure and Feature Learning
Cross-domain visual data matching is one of the fundamental problems in many
real-world vision tasks, e.g., matching persons across ID photos and
surveillance videos. Conventional approaches to this problem usually involves
two steps: i) projecting samples from different domains into a common space,
and ii) computing (dis-)similarity in this space based on a certain distance.
In this paper, we present a novel pairwise similarity measure that advances
existing models by i) expanding traditional linear projections into affine
transformations and ii) fusing affine Mahalanobis distance and Cosine
similarity by a data-driven combination. Moreover, we unify our similarity
measure with feature representation learning via deep convolutional neural
networks. Specifically, we incorporate the similarity measure matrix into the
deep architecture, enabling an end-to-end way of model optimization. We
extensively evaluate our generalized similarity model in several challenging
cross-domain matching tasks: person re-identification under different views and
face verification over different modalities (i.e., faces from still images and
videos, older and younger faces, and sketch and photo portraits). The
experimental results demonstrate superior performance of our model over other
state-of-the-art methods.Comment: To appear in IEEE Transactions on Pattern Analysis and Machine
Intelligence (T-PAMI), 201
Multi-Camera Action Dataset for Cross-Camera Action Recognition Benchmarking
Action recognition has received increasing attention from the computer vision
and machine learning communities in the last decade. To enable the study of
this problem, there exist a vast number of action datasets, which are recorded
under controlled laboratory settings, real-world surveillance environments, or
crawled from the Internet. Apart from the "in-the-wild" datasets, the training
and test split of conventional datasets often possess similar environments
conditions, which leads to close to perfect performance on constrained
datasets. In this paper, we introduce a new dataset, namely Multi-Camera Action
Dataset (MCAD), which is designed to evaluate the open view classification
problem under the surveillance environment. In total, MCAD contains 14,298
action samples from 18 action categories, which are performed by 20 subjects
and independently recorded with 5 cameras. Inspired by the well received
evaluation approach on the LFW dataset, we designed a standard evaluation
protocol and benchmarked MCAD under several scenarios. The benchmark shows that
while an average of 85% accuracy is achieved under the closed-view scenario,
the performance suffers from a significant drop under the cross-view scenario.
In the worst case scenario, the performance of 10-fold cross validation drops
from 87.0% to 47.4%
UG Track 2: A Collective Benchmark Effort for Evaluating and Advancing Image Understanding in Poor Visibility Environments
The UG challenge in IEEE CVPR 2019 aims to evoke a comprehensive
discussion and exploration about how low-level vision techniques can benefit
the high-level automatic visual recognition in various scenarios. In its second
track, we focus on object or face detection in poor visibility enhancements
caused by bad weathers (haze, rain) and low light conditions. While existing
enhancement methods are empirically expected to help the high-level end task,
that is observed to not always be the case in practice. To provide a more
thorough examination and fair comparison, we introduce three benchmark sets
collected in real-world hazy, rainy, and low-light conditions, respectively,
with annotate objects/faces annotated. To our best knowledge, this is the first
and currently largest effort of its kind. Baseline results by cascading
existing enhancement and detection models are reported, indicating the highly
challenging nature of our new data as well as the large room for further
technical innovations. We expect a large participation from the broad research
community to address these challenges together.Comment: A summary paper on datasets, fact sheets, baseline results, challenge
results, and winning methods in UG Challenge (Track 2). More materials
are provided in http://www.ug2challenge.org/index.htm
Learning Deep Feature Representations with Domain Guided Dropout for Person Re-identification
Learning generic and robust feature representations with data from multiple
domains for the same problem is of great value, especially for the problems
that have multiple datasets but none of them are large enough to provide
abundant data variations. In this work, we present a pipeline for learning deep
feature representations from multiple domains with Convolutional Neural
Networks (CNNs). When training a CNN with data from all the domains, some
neurons learn representations shared across several domains, while some others
are effective only for a specific one. Based on this important observation, we
propose a Domain Guided Dropout algorithm to improve the feature learning
procedure. Experiments show the effectiveness of our pipeline and the proposed
algorithm. Our methods on the person re-identification problem outperform
state-of-the-art methods on multiple datasets by large margins.Comment: To appear in CVPR201
Addressing Training Bias via Automated Image Annotation
Build accurate DNN models requires training on large labeled, context
specific datasets, especially those matching the target scenario. We believe
advances in wireless localization, working in unison with cameras, can produce
automated annotation of targets on images and videos captured in the wild.
Using pedestrian and vehicle detection as examples, we demonstrate the
feasibility, benefits, and challenges of an automatic image annotation system.
Our work calls for new technical development on passive localization, mobile
data analytics, and error-resilient ML models, as well as design issues in user
privacy policies
- …