12,646 research outputs found
CityFlow: A City-Scale Benchmark for Multi-Target Multi-Camera Vehicle Tracking and Re-Identification
Urban traffic optimization using traffic cameras as sensors is driving the
need to advance state-of-the-art multi-target multi-camera (MTMC) tracking.
This work introduces CityFlow, a city-scale traffic camera dataset consisting
of more than 3 hours of synchronized HD videos from 40 cameras across 10
intersections, with the longest distance between two simultaneous cameras being
2.5 km. To the best of our knowledge, CityFlow is the largest-scale dataset in
terms of spatial coverage and the number of cameras/videos in an urban
environment. The dataset contains more than 200K annotated bounding boxes
covering a wide range of scenes, viewing angles, vehicle models, and urban
traffic flow conditions. Camera geometry and calibration information are
provided to aid spatio-temporal analysis. In addition, a subset of the
benchmark is made available for the task of image-based vehicle
re-identification (ReID). We conducted an extensive experimental evaluation of
baselines/state-of-the-art approaches in MTMC tracking, multi-target
single-camera (MTSC) tracking, object detection, and image-based ReID on this
dataset, analyzing the impact of different network architectures, loss
functions, spatio-temporal models and their combinations on task effectiveness.
An evaluation server is launched with the release of our benchmark at the 2019
AI City Challenge (https://www.aicitychallenge.org/) that allows researchers to
compare the performance of their newest techniques. We expect this dataset to
catalyze research in this field, propel the state-of-the-art forward, and lead
to deployed traffic optimization(s) in the real world.Comment: Accepted for oral presentation at CVPR 2019 with review ratings of 2
strong accepts and 1 accept (work done during an internship at NVIDIA
Deep Recurrent Convolutional Networks for Video-based Person Re-identification: An End-to-End Approach
In this paper, we present an end-to-end approach to simultaneously learn
spatio-temporal features and corresponding similarity metric for video-based
person re-identification. Given the video sequence of a person, features from
each frame that are extracted from all levels of a deep convolutional network
can preserve a higher spatial resolution from which we can model finer motion
patterns. These low-level visual percepts are leveraged into a variant of
recurrent model to characterize the temporal variation between time-steps.
Features from all time-steps are then summarized using temporal pooling to
produce an overall feature representation for the complete sequence. The deep
convolutional network, recurrent layer, and the temporal pooling are jointly
trained to extract comparable hidden-unit representations from input pair of
time series to compute their corresponding similarity value. The proposed
framework combines time series modeling and metric learning to jointly learn
relevant features and a good similarity measure between time sequences of
person.
Experiments demonstrate that our approach achieves the state-of-the-art
performance for video-based person re-identification on iLIDS-VID and PRID
2011, the two primary public datasets for this purpose.Comment: 11 page
cvpaper.challenge in 2016: Futuristic Computer Vision through 1,600 Papers Survey
The paper gives futuristic challenges disscussed in the cvpaper.challenge. In
2015 and 2016, we thoroughly study 1,600+ papers in several
conferences/journals such as CVPR/ICCV/ECCV/NIPS/PAMI/IJCV
GAN-based Pose-aware Regulation for Video-based Person Re-identification
Video-based person re-identification deals with the inherent difficulty of
matching unregulated sequences with different length and with incomplete target
pose/viewpoint structure. Common approaches operate either by reducing the
problem to the still images case, facing a significant information loss, or by
exploiting inter-sequence temporal dependencies as in Siamese Recurrent Neural
Networks or in gait analysis. However, in all cases, the inter-sequences
pose/viewpoint misalignment is not considered, and the existing spatial
approaches are mostly limited to the still images context. To this end, we
propose a novel approach that can exploit more effectively the rich video
information, by accounting for the role that the changing pose/viewpoint factor
plays in the sequences matching process. Specifically, our approach consists of
two components. The first one attempts to complement the original
pose-incomplete information carried by the sequences with synthetic
GAN-generated images, and fuse their feature vectors into a more discriminative
viewpoint-insensitive embedding, namely Weighted Fusion (WF). Another one
performs an explicit pose-based alignment of sequence pairs to promote coherent
feature matching, namely Weighted-Pose Regulation (WPR). Extensive experiments
on two large video-based benchmark datasets show that our approach outperforms
considerably existing methods
Cross Domain Knowledge Learning with Dual-branch Adversarial Network for Vehicle Re-identification
The widespread popularization of vehicles has facilitated all people's life
during the last decades. However, the emergence of a large number of vehicles
poses the critical but challenging problem of vehicle re-identification (reID).
Till now, for most vehicle reID algorithms, both the training and testing
processes are conducted on the same annotated datasets under supervision.
However, even a well-trained model will still cause fateful performance drop
due to the severe domain bias between the trained dataset and the real-world
scenes.
To address this problem, this paper proposes a domain adaptation framework
for vehicle reID (DAVR), which narrows the cross-domain bias by fully
exploiting the labeled data from the source domain to adapt the target domain.
DAVR develops an image-to-image translation network named Dual-branch
Adversarial Network (DAN), which could promote the images from the source
domain (well-labeled) to learn the style of target domain (unlabeled) without
any annotation and preserve identity information from source domain. Then the
generated images are employed to train the vehicle reID model by a proposed
attention-based feature learning model with more reasonable styles. Through the
proposed framework, the well-trained reID model has better domain adaptation
ability for various scenes in real-world situations. Comprehensive experimental
results have demonstrated that our proposed DAVR can achieve excellent
performances on both VehicleID dataset and VeRi-776 dataset.Comment: arXiv admin note: substantial text overlap with arXiv:1903.0786
VGR-Net: A View Invariant Gait Recognition Network
Biometric identification systems have become immensely popular and important
because of their high reliability and efficiency. However person identification
at a distance, still remains a challenging problem. Gait can be seen as an
essential biometric feature for human recognition and identification. It can be
easily acquired from a distance and does not require any user cooperation thus
making it suitable for surveillance. But the task of recognizing an individual
using gait can be adversely affected by varying view points making this task
more and more challenging. Our proposed approach tackles this problem by
identifying spatio-temporal features and performing extensive experimentation
and training mechanism. In this paper, we propose a 3-D Convolution Deep Neural
Network for person identification using gait under multiple view. It is a
2-stage network, in which we have a classification network that initially
identifies the viewing point angle. After that another set of networks (one for
each angle) has been trained to identify the person under a particular viewing
angle. We have tested this network over CASIA-B publicly available database and
have achieved state-of-the-art results. The proposed system is much more
efficient in terms of time and space and performing better for almost all
angles.Comment: Accepted in ISBA (IEEE International conference on Identity, Security
and Behaviour Analysis)-201
Tracking Persons-of-Interest via Unsupervised Representation Adaptation
Multi-face tracking in unconstrained videos is a challenging problem as faces
of one person often appear drastically different in multiple shots due to
significant variations in scale, pose, expression, illumination, and make-up.
Existing multi-target tracking methods often use low-level features which are
not sufficiently discriminative for identifying faces with such large
appearance variations. In this paper, we tackle this problem by learning
discriminative, video-specific face representations using convolutional neural
networks (CNNs). Unlike existing CNN-based approaches which are only trained on
large-scale face image datasets offline, we use the contextual constraints to
generate a large number of training samples for a given video, and further
adapt the pre-trained face CNN to specific videos using discovered training
samples. Using these training samples, we optimize the embedding space so that
the Euclidean distances correspond to a measure of semantic face similarity via
minimizing a triplet loss function. With the learned discriminative features,
we apply the hierarchical clustering algorithm to link tracklets across
multiple shots to generate trajectories. We extensively evaluate the proposed
algorithm on two sets of TV sitcoms and YouTube music videos, analyze the
contribution of each component, and demonstrate significant performance
improvement over existing techniques.Comment: Project page: http://vllab1.ucmerced.edu/~szhang/FaceTracking
ReXCam: Resource-Efficient, Cross-Camera Video Analytics at Scale
Enterprises are increasingly deploying large camera networks for video
analytics. Many target applications entail a common problem template: searching
for and tracking an object or activity of interest (e.g. a speeding vehicle, a
break-in) through a large camera network in live video. Such cross-camera
analytics is compute and data intensive, with cost growing with the number of
cameras and time. To address this cost challenge, we present ReXCam, a new
system for efficient cross-camera video analytics. ReXCam exploits spatial and
temporal locality in the dynamics of real camera networks to guide its
inference-time search for a query identity. In an offline profiling phase,
ReXCam builds a cross-camera correlation model that encodes the locality
observed in historical traffic patterns. At inference time, ReXCam applies this
model to filter frames that are not spatially and temporally correlated with
the query identity's current position. In the cases of occasional missed
detections, ReXCam performs a fast-replay search on recently filtered video
frames, enabling gracefully recovery. Together, these techniques allow ReXCam
to reduce compute workload by 8.3x on an 8-camera dataset, and by 23x - 38x on
a simulated 130-camera dataset. ReXCam has been implemented and deployed on a
testbed of 5 AWS DeepLens cameras.Comment: 15 page
Distance-based Camera Network Topology Inference for Person Re-identification
In this paper, we propose a novel distance-based camera network topology
inference method for efficient person re-identification. To this end, we first
calibrate each camera and estimate relative scales between cameras. Using the
calibration results of multiple cameras, we calculate the speed of each person
and infer the distance between cameras to generate distance-based camera
network topology. The proposed distance-based topology can be applied
adaptively to each person according to its speed and handle diverse transition
time of people between non-overlapping cameras. To validate the proposed
method, we tested the proposed method using an open person re-identification
dataset and compared to state-of-the-art methods. The experimental results show
that the proposed method is effective for person re-identification in the
large-scale camera network with various people transition time.Comment: 10 pages, 11 figure
Probabilistic Semantic Retrieval for Surveillance Videos with Activity Graphs
We present a novel framework for finding complex activities matching
user-described queries in cluttered surveillance videos. The wide diversity of
queries coupled with unavailability of annotated activity data limits our
ability to train activity models. To bridge the semantic gap we propose to let
users describe an activity as a semantic graph with object attributes and
inter-object relationships associated with nodes and edges, respectively. We
learn node/edge-level visual predictors during training and, at test-time,
propose to retrieve activity by identifying likely locations that match the
semantic graph. We formulate a novel CRF based probabilistic activity
localization objective that accounts for mis-detections, mis-classifications
and track-losses, and outputs a likelihood score for a candidate grounded
location of the query in the video. We seek groundings that maximize overall
precision and recall. To handle the combinatorial search over all
high-probability groundings, we propose a highest precision subgraph matching
algorithm. Our method outperforms existing retrieval methods on benchmarked
datasets.Comment: 1520-9210 (c) 2018 IEEE. This paper has been accepted by IEEE
Transactions on Multimedia. Print ISSN: 1520-9210. Online ISSN: 1941-0077.
Preprint link is https://ieeexplore.ieee.org/document/8438958
- …