6,475 research outputs found
AI Oriented Large-Scale Video Management for Smart City: Technologies, Standards and Beyond
Deep learning has achieved substantial success in a series of tasks in
computer vision. Intelligent video analysis, which can be broadly applied to
video surveillance in various smart city applications, can also be driven by
such powerful deep learning engines. To practically facilitate deep neural
network models in the large-scale video analysis, there are still unprecedented
challenges for the large-scale video data management. Deep feature coding,
instead of video coding, provides a practical solution for handling the
large-scale video surveillance data. To enable interoperability in the context
of deep feature coding, standardization is urgent and important. However, due
to the explosion of deep learning algorithms and the particularity of feature
coding, there are numerous remaining problems in the standardization process.
This paper envisions the future deep feature coding standard for the AI
oriented large-scale video management, and discusses existing techniques,
standards and possible solutions for these open problems.Comment: 8 pages, 8 figures, 5 table
An Automatic System for Unconstrained Video-Based Face Recognition
Although deep learning approaches have achieved performance surpassing humans
for still image-based face recognition, unconstrained video-based face
recognition is still a challenging task due to large volume of data to be
processed and intra/inter-video variations on pose, illumination, occlusion,
scene, blur, video quality, etc. In this work, we consider challenging
scenarios for unconstrained video-based face recognition from multiple-shot
videos and surveillance videos with low-quality frames. To handle these
problems, we propose a robust and efficient system for unconstrained
video-based face recognition, which is composed of modules for face/fiducial
detection, face association, and face recognition. First, we use multi-scale
single-shot face detectors to efficiently localize faces in videos. The
detected faces are then grouped respectively through carefully designed face
association methods, especially for multi-shot videos. Finally, the faces are
recognized by the proposed face matcher based on an unsupervised subspace
learning approach and a subspace-to-subspace similarity metric. Extensive
experiments on challenging video datasets, such as Multiple Biometric Grand
Challenge (MBGC), Face and Ocular Challenge Series (FOCS), IARPA Janus
Surveillance Video Benchmark (IJB-S) for low-quality surveillance videos and
IARPA JANUS Benchmark B (IJB-B) for multiple-shot videos, demonstrate that the
proposed system can accurately detect and associate faces from unconstrained
videos and effectively learn robust and discriminative features for
recognition
Recurrent Embedding Aggregation Network for Video Face Recognition
Recurrent networks have been successful in analyzing temporal data and have
been widely used for video analysis. However, for video face recognition, where
the base CNNs trained on large-scale data already provide discriminative
features, using Long Short-Term Memory (LSTM), a popular recurrent network, for
feature learning could lead to overfitting and degrade the performance instead.
We propose a Recurrent Embedding Aggregation Network (REAN) for set to set face
recognition. Compared with LSTM, REAN is robust against overfitting because it
only learns how to aggregate the pre-trained embeddings rather than learning
representations from scratch. Compared with quality-aware aggregation methods,
REAN can take advantage of the context information to circumvent the noise
introduced by redundant video frames. Empirical results on three public domain
video face recognition datasets, IJB-S, YTF, and PaSC show that the proposed
REAN significantly outperforms naive CNN-LSTM structure and quality-aware
aggregation methods
Probabilistic Face Embeddings
Embedding methods have achieved success in face recognition by comparing
facial features in a latent semantic space. However, in a fully unconstrained
face setting, the facial features learned by the embedding model could be
ambiguous or may not even be present in the input face, leading to noisy
representations. We propose Probabilistic Face Embeddings (PFEs), which
represent each face image as a Gaussian distribution in the latent space. The
mean of the distribution estimates the most likely feature values while the
variance shows the uncertainty in the feature values. Probabilistic solutions
can then be naturally derived for matching and fusing PFEs using the
uncertainty information. Empirical evaluation on different baseline models,
training datasets and benchmarks show that the proposed method can improve the
face recognition performance of deterministic embeddings by converting them
into PFEs. The uncertainties estimated by PFEs also serve as good indicators of
the potential matching accuracy, which are important for a risk-controlled
recognition system.Comment: To appear in ICCV 201
Minor Privacy Protection Through Real-time Video Processing at the Edge
The collection of a lot of personal information about individuals, including
the minor members of a family, by closed-circuit television (CCTV) cameras
creates a lot of privacy concerns. Particularly, revealing children's
identifications or activities may compromise their well-being. In this paper,
we investigate lightweight solutions that are affordable to edge surveillance
systems, which is made feasible and accurate to identify minors such that
appropriate privacy-preserving measures can be applied accordingly. State of
the art deep learning architectures are modified and re-purposed in a cascaded
fashion to maximize the accuracy of our model. A pipeline extracts faces from
the input frames and classifies each one to be of an adult or a child. Over
20,000 labeled sample points are used for classification. We explore the timing
and resources needed for such a model to be used in the Edge-Fog architecture
at the edge of the network, where we can achieve near real-time performance on
the CPU. Quantitative experimental results show the superiority of our proposed
model with an accuracy of 92.1% in classification compared to some other face
recognition based child detection approaches.Comment: Accepted by the 2nd International Workshop on Smart City
Communication and Networking at the ICCCN 202
Face Recognition in Low Quality Images: A Survey
Low-resolution face recognition (LRFR) has received increasing attention over
the past few years. Its applications lie widely in the real-world environment
when high-resolution or high-quality images are hard to capture. One of the
biggest demands for LRFR technologies is video surveillance. As the the number
of surveillance cameras in the city increases, the videos that captured will
need to be processed automatically. However, those videos or images are usually
captured with large standoffs, arbitrary illumination condition, and diverse
angles of view. Faces in these images are generally small in size. Several
studies addressed this problem employed techniques like super resolution,
deblurring, or learning a relationship between different resolution domains. In
this paper, we provide a comprehensive review of approaches to low-resolution
face recognition in the past five years. First, a general problem definition is
given. Later, systematically analysis of the works on this topic is presented
by catogory. In addition to describing the methods, we also focus on datasets
and experiment settings. We further address the related works on unconstrained
low-resolution face recognition and compare them with the result that use
synthetic low-resolution data. Finally, we summarized the general limitations
and speculate a priorities for the future effort.Comment: There are some mistakes addressing in this paper which will be
misleading to the reader and we wont have a new version in short time. We
will resubmit once it is being corecte
SeqFace: Make full use of sequence information for face recognition
Deep convolutional neural networks (CNNs) have greatly improved the Face
Recognition (FR) performance in recent years. Almost all CNNs in FR are trained
on the carefully labeled datasets containing plenty of identities. However,
such high-quality datasets are very expensive to collect, which restricts many
researchers to achieve state-of-the-art performance. In this paper, we propose
a framework, called SeqFace, for learning discriminative face features. Besides
a traditional identity training dataset, the designed SeqFace can train CNNs by
using an additional dataset which includes a large number of face sequences
collected from videos. Moreover, the label smoothing regularization (LSR) and a
new proposed discriminative sequence agent (DSA) loss are employed to enhance
discrimination power of deep face features via making full use of the sequence
data. Our method achieves excellent performance on Labeled Faces in the Wild
(LFW), YouTube Faces (YTF), only with a single ResNet. The code and models are
publicly available on-line (https://github.com/huangyangyu/SeqFace)
On Learning Density Aware Embeddings
Deep metric learning algorithms have been utilized to learn discriminative
and generalizable models which are effective for classifying unseen classes. In
this paper, a novel noise tolerant deep metric learning algorithm is proposed.
The proposed method, termed as Density Aware Metric Learning, enforces the
model to learn embeddings that are pulled towards the most dense region of the
clusters for each class. It is achieved by iteratively shifting the estimate of
the center towards the dense region of the cluster thereby leading to faster
convergence and higher generalizability. In addition to this, the approach is
robust to noisy samples in the training data, often present as outliers.
Detailed experiments and analysis on two challenging cross-modal face
recognition databases and two popular object recognition databases exhibit the
efficacy of the proposed approach. It has superior convergence, requires lesser
training time, and yields better accuracies than several popular deep metric
learning methods.Comment: Accepted in IEEE CVPR 201
Privacy-Preserving Deep Inference for Rich User Data on The Cloud
Deep neural networks are increasingly being used in a variety of machine
learning applications applied to rich user data on the cloud. However, this
approach introduces a number of privacy and efficiency challenges, as the cloud
operator can perform secondary inferences on the available data. Recently,
advances in edge processing have paved the way for more efficient, and private,
data processing at the source for simple tasks and lighter models, though they
remain a challenge for larger, and more complicated models. In this paper, we
present a hybrid approach for breaking down large, complex deep models for
cooperative, privacy-preserving analytics. We do this by breaking down the
popular deep architectures and fine-tune them in a particular way. We then
evaluate the privacy benefits of this approach based on the information exposed
to the cloud service. We also asses the local inference cost of different
layers on a modern handset for mobile applications. Our evaluations show that
by using certain kind of fine-tuning and embedding techniques and at a small
processing costs, we can greatly reduce the level of information available to
unintended tasks applied to the data feature on the cloud, and hence achieving
the desired tradeoff between privacy and performance.Comment: arXiv admin note: substantial text overlap with arXiv:1703.0295
Tracking Persons-of-Interest via Unsupervised Representation Adaptation
Multi-face tracking in unconstrained videos is a challenging problem as faces
of one person often appear drastically different in multiple shots due to
significant variations in scale, pose, expression, illumination, and make-up.
Existing multi-target tracking methods often use low-level features which are
not sufficiently discriminative for identifying faces with such large
appearance variations. In this paper, we tackle this problem by learning
discriminative, video-specific face representations using convolutional neural
networks (CNNs). Unlike existing CNN-based approaches which are only trained on
large-scale face image datasets offline, we use the contextual constraints to
generate a large number of training samples for a given video, and further
adapt the pre-trained face CNN to specific videos using discovered training
samples. Using these training samples, we optimize the embedding space so that
the Euclidean distances correspond to a measure of semantic face similarity via
minimizing a triplet loss function. With the learned discriminative features,
we apply the hierarchical clustering algorithm to link tracklets across
multiple shots to generate trajectories. We extensively evaluate the proposed
algorithm on two sets of TV sitcoms and YouTube music videos, analyze the
contribution of each component, and demonstrate significant performance
improvement over existing techniques.Comment: Project page: http://vllab1.ucmerced.edu/~szhang/FaceTracking
- …