12,723 research outputs found
AI Oriented Large-Scale Video Management for Smart City: Technologies, Standards and Beyond
Deep learning has achieved substantial success in a series of tasks in
computer vision. Intelligent video analysis, which can be broadly applied to
video surveillance in various smart city applications, can also be driven by
such powerful deep learning engines. To practically facilitate deep neural
network models in the large-scale video analysis, there are still unprecedented
challenges for the large-scale video data management. Deep feature coding,
instead of video coding, provides a practical solution for handling the
large-scale video surveillance data. To enable interoperability in the context
of deep feature coding, standardization is urgent and important. However, due
to the explosion of deep learning algorithms and the particularity of feature
coding, there are numerous remaining problems in the standardization process.
This paper envisions the future deep feature coding standard for the AI
oriented large-scale video management, and discusses existing techniques,
standards and possible solutions for these open problems.Comment: 8 pages, 8 figures, 5 table
Face Recognition in Low Quality Images: A Survey
Low-resolution face recognition (LRFR) has received increasing attention over
the past few years. Its applications lie widely in the real-world environment
when high-resolution or high-quality images are hard to capture. One of the
biggest demands for LRFR technologies is video surveillance. As the the number
of surveillance cameras in the city increases, the videos that captured will
need to be processed automatically. However, those videos or images are usually
captured with large standoffs, arbitrary illumination condition, and diverse
angles of view. Faces in these images are generally small in size. Several
studies addressed this problem employed techniques like super resolution,
deblurring, or learning a relationship between different resolution domains. In
this paper, we provide a comprehensive review of approaches to low-resolution
face recognition in the past five years. First, a general problem definition is
given. Later, systematically analysis of the works on this topic is presented
by catogory. In addition to describing the methods, we also focus on datasets
and experiment settings. We further address the related works on unconstrained
low-resolution face recognition and compare them with the result that use
synthetic low-resolution data. Finally, we summarized the general limitations
and speculate a priorities for the future effort.Comment: There are some mistakes addressing in this paper which will be
misleading to the reader and we wont have a new version in short time. We
will resubmit once it is being corecte
Minor Privacy Protection Through Real-time Video Processing at the Edge
The collection of a lot of personal information about individuals, including
the minor members of a family, by closed-circuit television (CCTV) cameras
creates a lot of privacy concerns. Particularly, revealing children's
identifications or activities may compromise their well-being. In this paper,
we investigate lightweight solutions that are affordable to edge surveillance
systems, which is made feasible and accurate to identify minors such that
appropriate privacy-preserving measures can be applied accordingly. State of
the art deep learning architectures are modified and re-purposed in a cascaded
fashion to maximize the accuracy of our model. A pipeline extracts faces from
the input frames and classifies each one to be of an adult or a child. Over
20,000 labeled sample points are used for classification. We explore the timing
and resources needed for such a model to be used in the Edge-Fog architecture
at the edge of the network, where we can achieve near real-time performance on
the CPU. Quantitative experimental results show the superiority of our proposed
model with an accuracy of 92.1% in classification compared to some other face
recognition based child detection approaches.Comment: Accepted by the 2nd International Workshop on Smart City
Communication and Networking at the ICCCN 202
SeqFace: Make full use of sequence information for face recognition
Deep convolutional neural networks (CNNs) have greatly improved the Face
Recognition (FR) performance in recent years. Almost all CNNs in FR are trained
on the carefully labeled datasets containing plenty of identities. However,
such high-quality datasets are very expensive to collect, which restricts many
researchers to achieve state-of-the-art performance. In this paper, we propose
a framework, called SeqFace, for learning discriminative face features. Besides
a traditional identity training dataset, the designed SeqFace can train CNNs by
using an additional dataset which includes a large number of face sequences
collected from videos. Moreover, the label smoothing regularization (LSR) and a
new proposed discriminative sequence agent (DSA) loss are employed to enhance
discrimination power of deep face features via making full use of the sequence
data. Our method achieves excellent performance on Labeled Faces in the Wild
(LFW), YouTube Faces (YTF), only with a single ResNet. The code and models are
publicly available on-line (https://github.com/huangyangyu/SeqFace)
Image-to-Video Person Re-Identification by Reusing Cross-modal Embeddings
Image-to-video person re-identification identifies a target person by a probe
image from quantities of pedestrian videos captured by non-overlapping cameras.
Despite the great progress achieved,it's still challenging to match in the
multimodal scenario,i.e. between image and video. Currently,state-of-the-art
approaches mainly focus on the task-specific data,neglecting the extra
information on the different but related tasks. In this paper,we propose an
end-to-end neural network framework for image-to-video person reidentification
by leveraging cross-modal embeddings learned from extra information.Concretely
speaking,cross-modal embeddings from image captioning and video captioning
models are reused to help learned features be projected into a coordinated
space,where similarity can be directly computed. Besides,training steps from
fixed model reuse approach are integrated into our framework,which can
incorporate beneficial information and eventually make the target networks
independent of existing models. Apart from that,our proposed framework resorts
to CNNs and LSTMs for extracting visual and spatiotemporal features,and
combines the strengths of identification and verification model to improve the
discriminative ability of the learned feature. The experimental results
demonstrate the effectiveness of our framework on narrowing down the gap
between heterogeneous data and obtaining observable improvement in
image-to-video person re-identification.Comment: under review for Pattern Recognition Letter
Deep Learning Architectures for Face Recognition in Video Surveillance
Face recognition (FR) systems for video surveillance (VS) applications
attempt to accurately detect the presence of target individuals over a
distributed network of cameras. In video-based FR systems, facial models of
target individuals are designed a priori during enrollment using a limited
number of reference still images or video data. These facial models are not
typically representative of faces being observed during operations due to large
variations in illumination, pose, scale, occlusion, blur, and to camera
inter-operability. Specifically, in still-to-video FR application, a single
high-quality reference still image captured with still camera under controlled
conditions is employed to generate a facial model to be matched later against
lower-quality faces captured with video cameras under uncontrolled conditions.
Current video-based FR systems can perform well on controlled scenarios, while
their performance is not satisfactory in uncontrolled scenarios mainly because
of the differences between the source (enrollment) and the target (operational)
domains. Most of the efforts in this area have been toward the design of robust
video-based FR systems in unconstrained surveillance environments. This chapter
presents an overview of recent advances in still-to-video FR scenario through
deep convolutional neural networks (CNNs). In particular, deep learning
architectures proposed in the literature based on triplet-loss function (e.g.,
cross-correlation matching CNN, trunk-branch ensemble CNN and HaarNet) and
supervised autoencoders (e.g., canonical face representation CNN) are reviewed
and compared in terms of accuracy and computational complexity
ActionXPose: A Novel 2D Multi-view Pose-based Algorithm for Real-time Human Action Recognition
We present ActionXPose, a novel 2D pose-based algorithm for posture-level
Human Action Recognition (HAR). The proposed approach exploits 2D human poses
provided by OpenPose detector from RGB videos. ActionXPose aims to process
poses data to be provided to a Long Short-Term Memory Neural Network and to a
1D Convolutional Neural Network, which solve the classification problem.
ActionXPose is one of the first algorithms that exploits 2D human poses for
HAR. The algorithm has real-time performance and it is robust to camera
movings, subject proximity changes, viewpoint changes, subject appearance
changes and provide high generalization degree. In fact, extensive simulations
show that ActionXPose can be successfully trained using different datasets at
once. State-of-the-art performance on popular datasets for posture-related HAR
problems (i3DPost, KTH) are provided and results are compared with those
obtained by other methods, including the selected ActionXPose baseline.
Moreover, we also proposed two novel datasets called MPOSE and ISLD recorded in
our Intelligent Sensing Lab, to show ActionXPose generalization performance
Improved Hard Example Mining by Discovering Attribute-based Hard Person Identity
In this paper, we propose Hard Person Identity Mining (HPIM) that attempts to
refine the hard example mining to improve the exploration efficacy in person
re-identification. It is motivated by following observation: the more
attributes some people share, the more difficult to separate their identities.
Based on this observation, we develop HPIM via a transferred attribute
describer, a deep multi-attribute classifier trained from the source noisy
person attribute datasets. We encode each image into the attribute
probabilistic description in the target person re-ID dataset. Afterwards in the
attribute code space, we consider each person as a distribution to generate his
view-specific attribute codes in different practical scenarios. Hence we
estimate the person-specific statistical moments from zeroth to higher order,
which are further used to calculate the central moment discrepancies between
persons. Such discrepancy is a ground to choose hard identity to organize
proper mini-batches, without concerning the person representation changing in
metric learning. It presents as a complementary tool of hard example mining,
which helps to explore the global instead of the local hard example constraint
in the mini-batch built by randomly sampled identities. Extensive experiments
on two person re-identification benchmarks validated the effectiveness of our
proposed algorithm
PVSS: A Progressive Vehicle Search System for Video Surveillance Networks
This paper is focused on the task of searching for a specific vehicle that
appeared in the surveillance networks. Existing methods usually assume the
vehicle images are well cropped from the surveillance videos, then use visual
attributes, like colors and types, or license plate numbers to match the target
vehicle in the image set. However, a complete vehicle search system should
consider the problems of vehicle detection, representation, indexing, storage,
matching, and so on. Besides, attribute-based search cannot accurately find the
same vehicle due to intra-instance changes in different cameras and the
extremely uncertain environment. Moreover, the license plates may be
misrecognized in surveillance scenes due to the low resolution and noise. In
this paper, a Progressive Vehicle Search System, named as PVSS, is designed to
solve the above problems. PVSS is constituted of three modules: the crawler,
the indexer, and the searcher. The vehicle crawler aims to detect and track
vehicles in surveillance videos and transfer the captured vehicle images,
metadata and contextual information to the server or cloud. Then multi-grained
attributes, such as the visual features and license plate fingerprints, are
extracted and indexed by the vehicle indexer. At last, a query triplet with an
input vehicle image, the time range, and the spatial scope is taken as the
input by the vehicle searcher. The target vehicle will be searched in the
database by a progressive process. Extensive experiments on the public dataset
from a real surveillance network validate the effectiveness of the PVSS
Person Identification with Visual Summary for a Safe Access to a Smart Home
SafeAccess is an integrated system designed to provide easier and safer
access to a smart home for people with or without disabilities. The system is
designed to enhance safety and promote the independence of people with
disability (i.e., visually impaired). The key functionality of the system
includes the detection and identification of human and generating contextual
visual summary from the real-time video streams obtained from the cameras
placed in strategic locations around the house. In addition, the system
classifies human into groups (i.e. friends/families/caregiver versus
intruders/burglars/unknown). These features allow the user to grant/deny remote
access to the premises or ability to call emergency services. In this paper, we
focus on designing a prototype system for the smart home and building a robust
recognition engine that meets the system criteria and addresses speed,
accuracy, deployment and environmental challenges under a wide variety of
practical and real-life situations. To interact with the system, we implemented
a dialog enabled interface to create a personalized profile using face images
or video of friend/families/caregiver. To improve computational efficiency, we
apply change detection to filter out frames and use Faster-RCNN to detect the
human presence and extract faces using Multitask Cascaded Convolutional
Networks (MTCNN). Subsequently, we apply LBP/FaceNet to identify a person and
groups by matching extracted faces with the profile. SafeAccess sends a visual
summary to the users with an MMS containing a person's name if any match found
or as "Unknown", scene image, facial description, and contextual information.
SafeAccess identifies friends/families/caregiver versus intruders/unknown with
an average F-score 0.97 and generates a visual summary from 10 classes with an
average accuracy of 98.01%
- …